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JANUARY 1960 


THE PSYCHOLOGICAL REVIEW 


A STOCHASTIC MODEL FOR INDIVIDUAL 
CHO:iCE BEHAVIOR * 


R. J. AUDLEY 
University College, London 


This paper presents a stochastic 
model which is concerned with the 
interrelations of the response variables 
observed in choice situations. The 
model is not a complete theory, be- 
cause it involves no assumptions about 
the relations between stimulus and re- 
sponse variables. However, for given 
stimulus conditions, the parameters 
of the stochastic process do provide a 
convenient summary of many aspects 
of behaviour in a choice situation. 
Furthermore, the most elementary as- 
sumptions about the way in which 
these parameters might vary with 
changed stimulus conditions lead to 
predictions which are in qualitative 
agreement with experimental findings. 
In a sense, therefore, the stochastic 
model can be regarded as a rudimen- 
tary theory of certain aspects of choice 
behaviour. 


Descriptors of Choice Behavior 


A wide variety of experiments re- 
quire the use of a situation involving 
a choice between two or more alterna- 
tives. There are several variables 
which may be employed in a descrip- 
tive summary of the behavior which 


1 The writer is grateful to A. R. Jonckheere 
for his generous criticisms during the prepara- 
tion of the manuscript. He and G. C. Drew 
were also kind enough to comment upon en 
earlier draft. 


appears in these situations. These 
variables can be of two kinds. Firstly, 
there are descriptors of the primary 
response to the situation, and, sec- 
ondly, there are descriptors of the re- 
sponses which the S makes to his pri- 
mary choices. Those of the first kind 
are most commonly used and the three 
principal ones are: (a2) Response time 
—the time taken for a definite choice 
to be made. (0) Relative response 
frequency—the proportion of occa- 
sions on which a particular choice re- 
sponse is made. (c) The number of 
vicarious trial and error responses 
(VTEs)—the number of vacillations 
between the various alternatives be- 
fore a definite choice occurs. In the 
second group, where the descriptor is 
usually a verbal statement by the S, 
there are such variables as: (a) con- 
fidence in the correctness of a given 
choice and (6) an assessment of the 
subjective difficulty of the choice task. 

Clearly, the extent to which these 
various descriptors can be employed 
will depend upon the specific details 
of an experiment. But, for many 
choice situations, all three descriptors 
of the first kind can be employed. 
Also in most studies with human Ss 
the second kind are also available. 
In fact, this paper will be mainly con- 
cerned with the first kind of descriptor, 
but some suggestions will be advanced 
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which permit those of the second kind 
to be also included in a unitary sto- 
chastic description of choice behavior. 


Particular Choice Situations Which Are 
Considered 


It is believed that the underlying 
hypotheses upon which the stochastic 
description is based are applicable to 
most choice situations. However, the 
derivation of a mathematical model 
from these hypotheses which can be 
readily applied to experimental data 
without additional assumptions is 
more conveniently achieved for a cer- 
tain class of situations. This class 
consists of experiments where knowl- 
edge of the outcome or correctness of 
a response is not available to the S 
until after the choice has been made. 
Thus, for example, most ordinary dis- 
junctive reaction time studies are not 
considered because the S in these ex- 
periments can match his response with 
a known requirement. Nevertheless, 
the class of situations which can be 
considered is not a trivial one. It 
includes among others (a) Discrimina- 
tion experiments, including most con- 
ventional psychophysical procedures 
in this category. (5) Studies of prefer- 
ence and conflict. (c) Investigations 
of learning in choice situations. 

The next section of the paper is 
mainly concerned with the events sup- 
posed to be taking place during a 
single experimental trial. 


Tue Stocuastic 


The notions upon which the model 
is based are very simple and involve 
only two assumptions: 

Assumption 1. It is first assumed 
that, for given stimulus and organtsmic 
conditions, there is associated with each 
possible choice response a single param- 
eter. This parameter determines the 
probability that in a small interval of 


time (t,t + At), there will occur an 
“implicit” response of the kind with 
which the parameter is associated. 

No specific interpretation is given 
to the term “implicit response.’ ‘It 
may, in certain circumstances, be 
taken to be equivalent to the partical 
response usually classified as a VTE. 
But there are some situations in which 
VTEs are not observed and would 
seem unlikely to be present. In these 
cases the “‘implicit’’ response may be 
regarded asa tendency to make a given 
response, or might perhaps be given 
some physiological interpretation. 

The probabilities of the various 
kinds of “implicit’’ responses occur- 
ring are considered to be independent 
of one another. So that for given con- 
ditions, implicit responses of each kind 
are appearing at random intervals un- 
affected by the appearance of other 
implicit responses. It follows from 
the first assumption that the distribu- 
tion of the intervals between succes- 
sive implicit responses of a given kind 
is exponential and is determined en- 
tirely by the response parameter 
[e.g., see Feller, 1950, p. 220]. 

Assumption 2. It is assumed that 
a final choice response is made when a 
run of K implicit responses of a given 
kind appears, this run being uninter- 
rupted by occurrences of implicit re- 
sponses of other kinds. K may either 
be assumed to take a particular value 
or can be regarded as a further param- 
eter, which can be estimated from ex- 
perimental data. 

Assumption 1 has been employed 
before. Mueller (1950) has used this 
approach to describe the intervals be- 
tween bar-presses in an operant condi- 
tioning experiment where only one 
response is involved. For the same 
situation, Estes (1950) and Bush & 
Mosteller (1951) have used an as- 
sumption which is very similar, the 
only difference being that their models , 
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used a discontinuous rather than a 
continuous distribution of responses 
in time. Christie (1952) in discussing 
the determination of response prob- 
abilities in a discrimination experi- 
ment, has used the same assumption 
for situations where two responses are 
competing. Finally, the author of the 
present paper (Audley: 1957, 1958) 
has previously used the same notions 
to combine response times and re- 
sponse probabilities in a stochastic de- 
scription of individual learning be- 
havior. However, in all these ex- 
amples, it has been assumed that 
K = 1. Bush and Mosteller (1955), 
in an analysis of response times ir a 
runway situation, have considered a 
continuous model with AK > 1, but 
this generalization does not appear to 
have been previously employed in a 
situation involving choice. 

There are several reasons which can 
be advanced for assuming that K > 1. 
Firstly, when K = 1, but not if K > 1, 
the distributions of response times 
for all alternatives can be shown to 
be identically the same, and are ex- 
ponential (e.g., see Audley, 1958). 
Neither of these properties is in 
agreement with experimental findings. 
Secondly, when K > 1, the sequence 
of “implicit” responses occurring be- 
fore a final choice is made offer a 
possible means of inciuding VTE’s 
within the description of choice be- 
havior. Thirdly, classification of the 
various sequences of ‘‘implicit’’ choice 
suggests an approach to descriptors of 
the second kind. For example, “‘per- 
fect confidence” in a choice might be 
identified with sequences consisting of 
“implicit” responses of one kind only. 


Derivation of the Stochastic Model 


No further assumptions are required 
in the derivation of the model, which 
can be applied to situations involving 
any number, m, of choices. However, 


in order to keep the exposition as brief 
as possible, consideration in this paper 
will be limited to situations involving 
a choice between only two alterna- 
tives, i.e., m = 2. Furthermore, the 
mathematical problem is relatively 
simple when K = 2, so that only this 
special case will be presented. Re- 
sults for the more general case have 
been derived and will be elaborated 
elsewhere. 

The two-choice situation with K = 2. 
The two possible responses will be 
called A and B, and implicit responses 
of the two kinds will be labelled a and 
b respectively. Let the parameters 
associated with the two responses be 
a and 8. Assumption 1 means that 
p(a), the probability of an a occurring 
in a small time interval (t, £ + Af) is 
given by: 


p(a) = adAt [1a] 


Similarly 


p(b) = Bat [1b] 


The probability 7 (a or 5), of an im- 
plicit response of either kind but not 
both, occurring in the small time in- 
terval is 

p(a or b) = p(a)+p(b) —2p(a)p(d) 

= (a+) At—2aB (At)? 


Hence 
p(a or b) = (a+ 8) At 


if terms of order (At)? are ignored. 
This becomes possible if a transition 
is made to the continuous case when 
the distribution in time of implicit re- 
sponses follows that of a Poisson proc- 
ess (e.g., see Feller, 1950, p. 220). 
Therefore the probability, p(m, t), of 
obtaining m implicit responses in the 
time interval (0, ¢) is (e.g., again see 
Feller, 1950, p. 221): 
(a + 


pln, t) = 
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In particular the probability, p(o, ¢), 
of obtaining no implicit response of 
either kind in time ¢ is given by: 


plo, t) = e~(ats)t 


(3] 


The probability: P., that the first 
implicit response to occur is an @ is 


p(o, t)adt 


at+8B 


= say, p 


[4a] 


Similarly, for implicit 6 responses 


Py =say,g=1—p [4b] 


Since occurrences of implicit re- 
sponses follow a Poisson process, 
Equations 4a and 4b also give the 
probability that, starting at any given 
moment, the next implicit response to 
occur will be an a or b respectively. 
Therefore, ignoring for the moment 
questions concerning the time inter- 
vals between successive implicit re- 
sponses, the sequence of events lead- 
ing to a final choice can be treated as 
a sequence of independent binomial 
trials, with the probabilities, P, and 
P,, of the two types of event given by 
Equations 4a and 4b. 


The Probability, Px, That the Final 
Choice is an A Response 


The possible sequences which ter- 
minate with the occurrence of an A 
can be easily classified when K = 2. 
For they must all be simple alterna- 
tions between a and 6, until two suc- 
cessive a's occur. The early members 
of this class of sequences are: aa, baa, 
abaa, babaa, etc. The respective prob- 
abilities of these various sequences is 
clearly: p’, p’g, p’g, p’g? etc. The 
over-all probability, P4, that the final 
choice is an A, is the sum of this infi- 


nite series of sequence probabilities. 
Thus, 


+--- [5] 


Whence, simplifying, and substituting 
for p and g from Equations 4a and 4b 


ofa + 28] 


Similarly 


P 


+ 
Equation 6a may be written in the 
following form: 
— 
a+8 [(a+ B)* — a8} 


P 


Pa 


a 
so that when a > 8, P,4 > yer and 
B 


a+ 6B 

Thus the difference between the 
probabilities of the various implicit 
responses occurring is accentuated in 
the expressions for the probabilities of 
overt choice responses. The accentu- 
ation increases with K and implies 
that there is more certainty in the 
overt choices than in the underlying 
processes which determine them. This 
is believed to be a property which 
many organisms exhibit. 


Vicarious Trial and Error 


If we identify alternating appear- 
ances of the “implicit” responses, a 
and b, with VTEs, the moments of the 
distribution of VTEs can readily be 
obtained. Attention here will be con- 
fined to the mean number of VTEs 
preceding (a) any choice (b) a par- 
ticular choice. 


The Mean Number of VTEs Preceding 
Any Choice, 


There are no VTEs if the sequence 
of implicit responses is aa or bb. 
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There is 1 VTE if the sequence is 
baa or abb. 

There are 2 VTEs if the sequence is 
aha« or babb, and so on. 

"ding the sequences of implicit 

into those with an odd num- 

bei with an even number of 
VTks. ollowing probabilities are 
found (letting P(V = n) be the prob- 
ability of obtaining » VTEs) : 


pV=0) =p 

P(V = 2) = p'q + p¢? 

P(V = 4) = py + p’¢ 
etc. 

3) = pg + 

5) = pg + pg 


ote. 


P(V = 
P(V 
P(V = 


Now 
V = P(V = 1) + 2P(V = 2) 

+ 3P(V = 
and after some algebraic manipulation 


and again substituting for p and q 
from Equation 4a and 4b. 


3a8 
(a + 8)? — a8 


(7) 


If y = —, then Equation 7 may be re- 


B 
a 
a 


written as 


Thus V is dependent only on the ratio 
of 8 to a, and becomes a maximum 
when y = 1, i.e., a = 8B. Therefore 
the number of VTEs would be a maxi- 
mum when P, = Pz, = }. 


The Mean Number of VTEs Preceding 
A and B Responses, V, and Vp 


Separate consideration of the mean 
number of VTEs preceding an A and 


B choice yields the following results: 


2a8 B 
2a8 
(a + B)* — a8 


Ve 


a 
la +8 [8b] 


B a 
a + 2p May be re- 
1 
and respec- 
= 
B a 


tively, it can be seen that on the 
average there would be fewer VTEs 
preceding the response which is domi- 
nant at any given moment, ie., if 
> Ps, Va < Vo. 


The Time Distribution of Final Choice 


It is possible to determine all the 
moments of the time distribution of 
final responses. Here, however, con- 
sideration will be limited to the mean 
latency, L, of all responses and the 
mean latencies for A and B re- 
sponses taken separately, L, and Lz 
respectively. 


The Mean Latencies for A and B Re- 
sponses, La and Lz 
Let P(a, t) be the probability that, 


at time ¢, no two consecutive a’s or 
b's have appeared, and that the last 
implicit response was an a. Let 
P(a,t;n) be the probability that, at 
Line t, no two consecutive a’s or b’s 
have appeared, and that the last im- 
plicit response was an a, and also that 
there have been exactly n implicit 
responses. Thus 


Since 


written as 


P(a,t) = P(a,t;n) 
n=l 
To determine P(a,t;m), Equation 2 
and the method employed to find Pa 
are combined. 
Let P(a; n) be the probability that 


a sequence of m events ends with an a, 


al 
Ke 
= 
i 
aa 
= 
fi 
4 
PY 
t 
= 
q 
4 
; 
3 
| 
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no two consecutive a’s or b’s having 
occurred. Clearly, 


a 
P(a;1) = 

ap 

a’B 
P(a; 3) = 


these probabilities being respectively 
associated with the sequences; a, ba, 
aba, etc. 

Now P(a,t;n) = P(n,t). P(a;n), 
and Equation 2 gives P(n, t), so that 


P(a, t; 1) 
= P(1, t)-P(a; 1) 
| 
P (a, t; 2) 
2! “(a + B)? 
t 
Similarly 
P(a,t; 3) = = 
etc. Hence 
P(a,t) = P(,t;n) 
n=l 
ape (a+8)t 
2! 
3! 
which, upon simplification, gives 


P(a, t) :) 


Similarly it may be determined that 


P(e, 


Now 


La= P(a, f P (a, t)adt 
t=o t=o 


_ __2(@+8) 
(a + — a8 
B 
(@ + 8)(a + 28) 


+ [10a] 


and similarly 

+ B) 

(a + B)? — ap 

(a + B)(2a + 
By the same kind of argument it 


may be demonstrated that the mean 
latency for all responses, L, is given by 


Lz = 


+: [10b 


_ 2 
a+8 
Returning to Equations 10a and 
B 


10b it can be seen that 


and may be written as 


(a+8) (2a+8) 


1 1 
and B re- 
(a+6)( $+2) (a+8)( 2+2) 


spectively. Thus the dominant re- 


sponse will, on the average, have a 
shorter choice time than the other, 
i.e., if P, > Px, | < Le 

In order to compare the theoretical 
response time distribution to observed 
data, the probability P (0, ¢) of no final 
response having occurred by time ¢ is 


| 
R. J. AupLey 
| 
; 
J 
ak 
3 
© 
| 
[9a ] 


A Stocuastic Moper For INpIvipvaAL CHOICE BEHAVIOR 7 


also given. This is clearly 
P(0,t) = P(o,t) + P(a,t) + P(d, t) 


P(o, t) is given by Equation 3 and 
P(a,t) and P(6,t) by Equations 9a and 
9b so that, upon some simplification, 


P(0, th=e (ela 1) 


a 


+ 
[12] 


The Model and Descriptors of the 
Second Kind 


At present, it is only possible to ad- 
vance some speculations concerning 
variables such as “degree of confi- 
dence” in the correctness of a given 
choice. Nevertheless, it seems worth 
considering these since there appears 
to be a definite relation between the 
second kind of descriptor and the more 
conventional indices of choice be- 
havior. Henmon (1911), whose paper 
will be considered in more detail 
later, showed that choices regarded 
by an S with confidence are generally 
quicker and more accurate than others. 
This result was demonstrated in a 
psychophysical discrimination situa- 
tion where a definite, correct choice 
existed. 

There seem to be two possible ways 
in which “confidence” might be at- 
tributed to a particular choice. The 
first of these involves some classifica- 
tion of the various sequences of im- 
plicit responses preceding a final 
choice. For example, sequences which 
involve no vacillation at all, such as 
aa, or bb, might be regarded as “‘more 
confident” than sequences involving a 
large number of vacillations, such as 
abababaa. It will be shown that this 
kind of ‘“‘confident’’ sequence has the 
properties required by Henmon'’s data. 

For, suppose A be the correct and B 
the incorrect choice in a psychophys- 


ical situation, then generally speaking 
one would expect a > 8. The prob- 
ability of the sequen.- aa would be 


2 
= and the probability of 0d, 


(a + 
B? 


(a + 
of being correct for this type of con- 
fident “‘choice,”’ i.e., choosing A, is 
given by 


Hence, the probability, Pc, 


a’ 


a? + 


[13] 

Comparing this probability with the 
overall probability of an A response, 
P. given by Equation 6a, 


a? (a+28) 
(a — B) 
[14] 


Clearly, Equation 14 is positive when 
a > Band hence Pc > Px. 

Since fo. chese ‘‘confident”’ responses 
only two implicit responses occur be- 
fore a final choice, it is clear that their 
mean response time is shorter than the 
over-all average response time. This 
approach consists essentially in equat- 
ing “degree of confidence”’ with some 
function of the reciprocal of the num- 
ber of VTEs preceding the final choice. 

The second suggested approach to 
judgmental confidence is based upon 
the fact that these appraisals of a re- 
sponse, under normal instructions, fol- 
low after the response itself. Degree 
of confidence, therefore, might be as- 
sociated with implicit responses con- 
tinuing to occur after an overt choice 
response has occurred. If, after an A 
response has been made, a further a 
occurs in the time before the state- 
ment of confidence is produced, this 
might be. taken to lead to greater con- 


i 
th 
| 
‘2 
rt 
a 
| 
A 
“4 4 
‘ 
3 
| 
é 


8 R. J. AuDLEY 


fidence than if nothing or a 6 appeared. 
Indeed, it might be possible to develop 
a model for the distribution of the 
times between making the primary 
choice response and giving an esti- 
mate for degree of confidence from 
this kind of assumption. : 

Other approaches to the second kind 
of descriptor are undoubtedly possible 
within the present scheme. The im- 
portant point is that it is possible to 
test these various hypotheses quite 
easily. They each predict how often 
a given level of confidence would be 
employed. Also the expected distri- 
bution of descriptors of the first kind 
associated with each level of confi- 
dence can be determined. 


THE AGREEMENT BETWEEN THE 
PROPERTIES OF THE MODEL 
AND EMPIRICAL DATA 


The principal aim of this paper is 
to show that a set of very simple as- 
sumptions can be used to derive rela- 
tions which might be expected among 
the variables observed in a choice 
situation. In an exposition of this 
kind it is not possible to examine, in 
any detail, the success of the model in 
describing the results of experiments 
which are relevant. For one thing, 
only the particular case arising when 
K = 2 has been presented, whereas in 
practice it may be more profitable to 
treat K as a parameter. Also, the 
argument so far presented is concerned 
with the events supposed to occur at 
a single experimental trial. The 
manner in which the model is applied 
to experimental data based upon a 
number of trials will depend very 
much upon the way in which separate 
trials resemble one another. There 


may be actual variations in stimulus. 


conditions from trial to trial, or there 
may be a direct dependence of later 
upon earlier trials, as in learning ex- 
periments. For this reason, considera- 


tion of quantitative evidence will be 
mainly confined to an experiment by 
Henmon (1911), in which the condi- 
tions under which individual trials 
were conducted closely resemble one 
another and where it can reasonably 
be assumed that there are no sys- 
tematic changes in an S’s behavior. 
This data can therefore be regarded 
as appropriate for testing the model 
without there being any need to make 
further special assumptions. How- 
ever, before examining Henmon’s re- 
sults, it seems worthwhile to exhibit 
the manner in which the model seems 
to match empirical evidence about 
choice behavior in general. 

In effecting a general appraisal of 
the model, one is hindered by the 
general lack of individual results in 
the experimental literature. For rea- 
sons which cannot be examined here 
it seems preferable to test hypotheses 
about functional relations upon indi- 
vidual data. A brief argument for this 
point of view has been presented by 
Bakan (1955) and for the study of 
learning behavior by Audley and 
Jonckheere (1956). The reader is re- 
ferred to these papers for further de- 
tails. However, irrespective of the 
stand taken on this question, it is 
clear that the present model is con- 
cerned with individual results and that 
such results are not generally avail- 
able. For this reason, the following 
comparison of the model with experi- 
mental evidence is largely qualitative, 
although, given appropriate data, 
quantitative comparisons would have 
been possible. 


Psychophysical Discrimination Situa- 
tions 


In considering results from psycho- 
physical experiments, say using the 
constant method, it is necessary to 
consider.separately the comparison of 
each variable with the standard. This 
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is so because no assumptions have thus 
far been made about the relation be- 
tween stimulus and response variables. 
In spite of this, some general predic- 
tions can be made. 

Consider the results obtained from 
the comparison of the standard with a 
particular variable stimulus. In this 
comparison, it can be supposed that 
the responses A and B refer to the 
respective statements “‘the variable is 
greater than the standard” and “the 
variable is smaller than the standard.” 
a will clearly be a monotonically in- 
creasing function of the magnitude of 
the variable, and 8 a monotonically 
decreasing function of the same mag- 
nitude. At the PSE, a = 8. Within 
limits, and certainly for a range of 
stimuli close to the PSE, (a + 8) can 
be assumed to be approximately con- 
stant. This supposition is not crucial, 
but simplifies the ensuing argument. 


Relation of Judgment Time to the Per- 
ceived Distance between Stimuli 


Equation 11 gives the mean choice 
time as a function of a and 8. This 
can be rewritten in the following way : 


L 


af 

If (a+) is approximately constant, 
L will depend principally upon the 
product of the parameters, «8. Thus 
L will have a maximum when a=8. 
From Equation 6a it can be seen that 
the point, a = 8, also defines the PSE, 
since for these parameter values P, 
= Pz, = 0.5. Itcan be seen that deci- 
sion time will therefore rise mono- 
tonically up to the PSE and then de- 
crease monotonically beyond the PSE. 
For the range and distribution of 
stimuli employed in most psycho- 
physical studies, the decrease in deci- 
sion time upon either side of the PSE 
will be, according to the model, 


? 
=- -——4 - 


(15] 


approximately symmetrical. These 
properties are in agreement with em- 
pirical data, as for example summar- 
ized by Guilford (1954). 

Even where the S is allowed three 
categories of response, it is the bound- 
aries between these categories which | 
show peak decision times (Cartwright, 
1941). This would be expected if a 
further parameter be used to charac- 
terize “equal”’ or “doubtful” responses. 
It would be of great interest to deter- 
mine whether, in fact; a further re- 
sponse parameter is required when a 
third response category is permitted. 
Almost by definition, the response 
“doubtful” implies that no decision 
has been reached by a certain time. 
Such responses would then appear to 
be best described by the time which 
the S is willing to spend in attempting 
to come to a decision. This would 
make the range of stimuli over which 
judgments of “doubtful’’ are made 
depend only indirectly upon differ- 
ential sensitivity. The readiness of 
the S to continue attempting to arrive 
at a definite answer would also play 
an important role. This is in accord 
with the generally accepted view of 
the use of a third category, e.g., Wood- 
worth (1938), Guilford (1954). On 
the other hand, a parameter to specify 
judgments of “equality” may stili be 
required. This would allow for a time 
determined ‘‘doubtful” judgment of 
the kind discussed above, but would 
also introduce a true “equals” cate- 
gory. This would enable an analysis 
of the third category to be carried out 
in accordance with the suggestions of 
Cartwright (1941) and George (1917). 


The Relation between Confidence, Deci- 
ston Time and Perceived Distance 
between Stimuli 
The exact nature of the relations 


between the variables considered in 
this section, will depend upon whether 
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stimulus conditions are the same for 
all trials. Nevertheless, some general 
predictions can be advanced. 

Here, ‘‘degree of confidence”’ will be 
equated with some function of the 
reciprocal of the number of VTEs pre- 
ceding a final choice. The number of 
VTEs can, of course, range from zero 
to infinity. Generally speaking, con- 
fidence is rated upon some scale from 
zero to unity. Let C, be the degree of 
confidence associated with a given 
choice, and, V, the number of VTEs 
preceding this choice act. Determin- 
ing a suitable relation between C and 
V would, in fact, be one of the experi- 
mental problems suggested by the 
present approach. For the moment, 
however, it will be assumed that, 

1 


(16] 


so that when V = 0, C = 1; and when 
V=o,C=0. 

It will be recalled from the section 
concerned with VTEs that the mean 
number of these will, when K = 2, be 
two less than the number of implicit 
responses preceding a final choice. 
Now it can easily be demonstrated, 
using Equation ic, that the mean 
choice time when n implicit responses 
occur, 7,, is given by 

n 
T, = [17 ] 

Whence, since V = m — 2, and be- 
cause m is eliminated from Equation 
17, it is possible to express the mean 
choice time 7, as a function of V, 
given by 


[18] 
Substituting for V from Equation 16 


and adding an arbitrary constant, 7», 
for the minimum choice time possible, 


1 
P= [19] 


1 
+ To. 


Ctades 


This hyperbolic function is in agree- 
ment with experimental determina- 
tions of the relation between confi- 
dence and judgment time, e.g., see 
again Guilford (1954). 

If the stimulus conditions are varied 
between different sets of trials, as for 
example in the constant method dis- 
cussed in the previous section, general 
conclusions are again possible. For 
in discussing Equation 7, it was shown 
that the mean number of VTEs de- 
pends only upon the ratio of a to 8. 
Again assuming that (a + 8) is ap- 
proximately constant, V would be a 
roughly symmetrical function of the 
magnitude of the variable, having a 
maximum at the PSE. Thus the 
average degree of confidence, C, would 
be a roughly U shaped function hav- 
ing a minimum at the PSE. Since 
choice time has been shown to have 
a maximum at the PSE and to de- 
crease upon either side of this point, 
C and T would again vary inversely. 
This agrees with experimental data 
(see Guilford, 1954). 


Preference and Conflict Situations 


In this kind of situation, a number 
of objects are paired and the subject 
makes a choice indicating the pre- 
ferred object of each pair. For any 
given pair of objects, say A and B, 
the parameters a and 8 can be taken 
to represent some measure of prefer- 
ence for A and B. Because there are 
a number of objects, it is more con- 
venient to label the r objects presented 
to the subject as X,, and to let the 
parameter associated with a kind of 
“absolute preference” for each, be 
a; (@ = 1,2,---7r). The and 8 of 
the equations will now be replaced by, 
say a; and ax, for the comparison of 
the ith and jth objects, X; and X;. 
This, of course, is to make the very 
strong assumption that the a,’s are in- 
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dependent of the particular compari- 
son in which they are involved. This 
assumption could be readily tested by 
using the model appropriately, and is 
accepted here only in order to simplify 
notation. The results of the following 
argument would be qualitatively the 
same, even if there were in fact, 
contextual effects peculiar to each 
comparison. 

Variation in choice time among dif- 
ferent comparisons. The set of r ob- 
jects, on the basis of a paired compari- 
son technique, can usually be ranked. 
Let i be an individual’s ranking of 
an object, so that we may write 
Xi >X2>--- 
meaning X; is preferred to X; and so 
on. This means that a; > a; > --- 
> a; > ain: >+--> ay. Consider any 
pair of parameters, say a; and ax, and 
let these be the a and £ of the earlier 
equations. Then the mean choice 
time is given by Equation 11, and this 
can now be rewritten as 


2 
Lie 


ay + 


4 3a 
[ay + ax (aj + ax)? — 


Clearly depends upon things ; 
the sum of the parameters (a; + ax) 
and, secondly, the product of the pa- 
rameters, ajax. Other things being 
equal, the choice time will decrease 
as (a; + ax) increases. Again, with 
(a; + ax) constant, L,;,x) will increase 
with the product, reaching a maximum 
when a; = a. Choice time will there- 
fore (a) depend upon the general level 
of preference for objects, being quicker 
for preferred objects, (5) will be quicker 
the greater the difference in preference 
for the two paired objects. This in 
agreement with experimental finding, 
e.g., for children choosing among 
liquids to drink, Barker (1942), for 
aesthetic preferences, Dashiell (1937). 


[20] 


It will be interesting to determine 
how far the assumption of an absence 
of contextual effects can be main- 
tained. If the assumption turns out 
to be approximately true, then the 
parameters, a;, would provide a means 
of scaling the stimulus objects for a 
given individual. In essence, such an 
approach would resemble that adopted 
by Bradley and Terry (1952), but 
would have the added advantage that 
the scale values would have an abso- 
lute rather than a relative basis, so 
that the scale values should be un- 
affected by the inclusion of new 
comparisons. 

Number of VTEs for different com- 
parisons. It was shown, in discussing 
Equation 7, that the mean number of 
VTEs in a given situation, depends 
ei.tirely upon the ratio of @ to 8. 
Using the present notation this would 
be the ratio of a; to ax, for objects X; 
and X,;. The number of VTEs has a 
maximum when’ aj = a, and de- 
creases as the values of the parameter 
become more disparate. Thus the 
number of VTEs should depend en- 
tirely upon the differences in prefer- 
ence and not upon the general level of 
preference for the two paired objects. 
Thus for adjacent objects, X; and 
X41, the number of VTEs before a 
final choice will not rise with choice 
time as one proceeds from preferred to 
nonpreferred objects. This is slightly 
complicated by differences in “‘prefer- 
ence distance” between adjacent ob- 
jects, but the prediction is again found 
to be in agreement with experimental 
evidence, e.g., see Barker (1942). 

Learning in choice situations. It is 
in considering learning behavior that 
the need for individual results is 
greatest (Audley & Jonckheere, 1956). 
The full advantages of the present 
approach to response variables can 
only be gained by incorporating the 
assumption in a stochastic model for 
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learning. The way in which this 
might be contrived, when K = 1, has 
already been outlined and illustrated 
elsewhere (Audley: 1957, 1958). On 
the whole, therefore, the experimental 
literature does not provide results in 
a way which enable the predictions of 
the model to be falsified, even at a 
qualitative level. The most that can 
be done here is to show that the pre- 
dictions might well be good approxi- 
mations to the properties of learning 
data. 

Given a particular theory of learn- 
ing it would, of course, be possible to 
anchor the theory more closely to re- 
sponse variables by identifying the 
parameter of the choice model with an 
appropriate theoretical construction. 

The properties of the model and 
simple learning behavior. Consider, 
for example, learning in a simple two- 
choice situation. Let a be associated 
with A, the correct response, and 8 
with B, the incorrect response. The 
way in which a and 8 vary with re- 
ward and punishment is naturally a 
matter for investigation and would 
certainly condition the form of the 
prediction which would be made. 
Nevertheless, it is not unreasonable to 
assume that a@ will be some monotonic 
increasing function, and 8 some mono- 
tonic decreasing function of practice 
and of punishments and rewards. 

Let it be supposed that the S has 
at first a strong tendency to produce 
the incorrect choice, i.e., @ is small 
relative to 8. Consider, firstly, what 
might be expected to happen to the 
over-all latency L, and the latencies 
of A and B, Ly, and Lg respectively. 
In discussing Equations 10a and 10b 
it was shown that the dominant re- 
sponse, on the average, will have the 
shorter choice time. Thus in the first 
place it will be expected that L, will 
be greater than L- until the prob- 
ability of making the correct choice, 


reaches and exceeds 0.5, when 
will be generally shorter than Lg. 

All of the latencies are dependent 
upon two factors, the sum (a + 8) and 
the ratio of a to 8. The over-all la- 
tency, L, if (a + 8) remains constant, 
will rise to a maximum until Pa, 
= Pg = 0.5 (i.e., a = 8) and then 
fall again. Superimposed upon this 
rise and fall will be the influence of 
(a + 8), and if the levels of, say 
punishment and reward, are such as 
to disturb the constancy of this quan- 
tity, then there will be an accentua- 
tion or flattening of the curve of 
latency as a function of practice. The 
monotonic decline in response la- 
tencies observed when an S is intro- 
duced into a learning situation for the 
first time does not counter this predic- 
tion. For, then, it is to be expected 
that (a + 8) will be initially small and 
the effect of increasing a, and, hence, 
(a + 8) will be reinforced by the grow- 
ing difference in magnitude between a 
and 8. In original learning, therefore, 
the two factors work together and 
produce the monotonic decrease in 
latency. 

The number of VTEs, from Equa- 
tion 7, is seen to be a function only of 
the ratio of ato 8. Thus VTEs would 
be expected to rise to a maximum until 
a = £, i.e., Ps = Pg = 0.5, and the 
decline. 

These predictions are probably only 
applicable to the very simple two- 
choice situations so far considered. 
For discrimination studies, the prob- 
lem is complicated by the way in 
which the relevant cues are being 
utilized by the organism and there is 
no point in reviewing the controversy 
over this matter. It does however 
seem worthwhile pointing out that, in 
discrimination behavior, it is very 
probable that there appears something 
like the problem of the use of the third 
category in psychophysical proced- 
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ures. That is, a distinction seems to 
be necessary between, on the one 
hand, a definite act of choice and, on 
the other hand, behavior which occurs 
simply because something has to be 
done in the situation. This specula- 
tive point is raised because the size 
of the parameters may exert an influ- 
ence upon behavior in two ways. 
Firstly, by determining the prob- 
ability of making a particular response 
when a “true” choice is made and, 
secondly, by determining the prob- 
ability that a “‘true’’ choice is made. 

Henmon’'s experiment. The experi- 
ment conducted by Henmon (1911) is 
of particular interest, because it pro- 
vides data from individual Ss, in a 
situation where stimulus conditions 
can be assumed to be fairly constant 
from trial to trial. The observations, 
therefore, are important for any model 
concerned with the properties of 
choice behavior. 

Henmon required Ss, in each of 
1,000 trials, to decide whether one of 


two horizontal lines was longer or 


shorter than the other. The lengths 
of the lines were always 20 mm and 
20.3 mm respectively. In addition, 
Ss were instructed to indicate their 
confidence in each judgment. 

The model is qualitatively in agree- 
ment with Henmon’s data, except in 
two things. Firstly, although aver- 
age choice time for wrong responses is 
larger than that for correct choices, as 
predicted by the model, the wrong 
responses are relatively quicker in 
each category of confidence. The 
second qualitative difference appears 
in examining accuracy as a function 
of time. There is some indication for 
some Ss that although there is a 
general decline in accuracy with longer 
choice times, again predicted by the 
model, there is also a slight rise in 
accuracy in going from very short to 
moderately short choice times. It is 


possible that both of these differences 
may be accounted for by a suitable 
analysis of judgments of confidence 
about which only a few speculations 
have been advanced in the present 
paper. The important point, it seems 
to the author, is that the general 
stochastic model is capable of dealing 
with this kind of issue, rather than 
that it succeeds in all details at the 
present time. 

Henmon gives the distribution of 
all choice times for each individual. 
Since this can also be derived from 
the model, a comparison of the two 
distributions should give further indi- 
cations as to the adequacy of the 
present approach to choice behavior. 
In testing the goodness of fit of the 
model in this matter, it would be 
usual to estimate the parameters from 
the distribution of choice times alone. 
However, it was decided that perhaps 
a stronger case could be made out if 
the only time datum used to estimate 
the parameters was the mean latency. 
Two equations are of course required 
if values of a and £ are to be deter- 
mined, and P,, the probability of a 
correct response, was chosen for the 
second. Accordingly the present esti- 
mates are based upon Equations 6a 
and 11. 

There must, of course, be some 
minimum response time before which 
no response can occur. This is not 
easy to determine from Henmon’s 
tables of results, because the data are 
already grouped in intervals of 200 
milliseconds. For this reason, the 
minimum possible time was estimated 
in the following way. For various 
assumed minimum times, estimates of 
a and 8 were determined, and the 
theoretical distribution of choice times 
computed. The value leading to the 
best fit was then adopted. This is not 
entirely a satisfactory procedure, but 
with K assumed to be 2, and with no 
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TABLE 1 


Subject Bl Subject Br 


Time interval in Observed Expected | Time interval in | Observed | Expected 
milliseconds Frequency | Frequency milliseconds Frequency | Frequency 
100- (2)* | 100-299 | 
300— 57 | 53 300- 350 352 
500— 214 229 500- 381 39 
700- 220 229 700— 170 165 
900— 159 168 900— 65 57 
1100- =| 113 | 1100~ 26 19 
1300- | 85 83 1300- 5 | 
Above 1500 


Above 


* These observations ignored in calculations. 


direct indication of the minimum simple laws which operate in most 
time, it seemed the best available in choice situations. In this way, not 
the circumstances. The results for only are descriptions of choice be- 
Henmon’s (1911, Table 2, p. 194) Ss havior considerably simplified, but 
BI and Br are considered below. better ways of formulating and testing 
For Bl, the minimum possible time theories are suggested. The model 
was taken to be about 0.40 sec. On itself is naturally also a theory about 
this basisa = 3.19 and 8 = 1.28,these a certain aspect of behavior, and as 
values referring to a time scale meas- such needs to be tested. 
ured in seconds. For Br, the mini- In this presentation of the general 
mum time was taken to be 0.34 sec. stochastic model the intention is to 
giving a = 6.68 and 8 = 4.28. A indicate the potentialities of the ap- 
comparison of the observed and ex-_ proach, rather than to make specific 
pected distributions of response times _ tests of the case arising when K = 2. 
is given in Table 1. The agreement It is not to be expected that the two 
between model and data seems to be simple assumptions will alone account 


reasonably good. for the relations existing between re- 
sponse variables in a wide diversity of 
CONCLUDING REMARKS situations. Each situation will un- 


.  doubtedly have certain unique condi- 

On the whole, there Is a certain tions which have to be taken into ac- 
looseness in the way in which many count. Rut the dete 
contemporary theories and even local share certain important properties 
hypotheses are linked to observed re- with choice behavior and therefore it 
sponse variables. It seems worth- appears to be a reasonable initial 
while, therefore, to try to determine working hypothesis. It can be tested 
whether these variables might not be in great detail against data, and the 
related to one another by relatively parameters are of a kind which could 
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be identified with either psychological 

or physiological constructs. 

Methods of estimating parameters 
and statistical tests of goodness of fit 
will be discussed elsewhere. For the 
present model, neither of these pro- 
cedures involves airy novel problems. 
For example, given the probability of 
occurrence of one of the alternative 
responses and the over-all mean re- 
sponse time, Equations 6 and 11 may 
be easily solved to give the appro- 
priate parameter values. 
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The concept of stimulus distinctive- 
ness occurs in a number of different 
areas of psychology. Miller and Dol- 
lard (1941) have emphasized the 
relationship between the cue prop- 
erties of a stimulus and its distinc- 
tiveness; also, these same authors 
(Dollard & Miller, 1950) have sug- 
gested that the distinctiveness of a 
stimulus may be increased by attach- 
ing a verbal label to it (‘‘acquired 
distinctiveness of cues’’). Distinc- 
tiveness may play a role in such 
diverse phenomena as the von Restorff 
effect in verbal learning (von Restorff, 
1933), the “shock-right” effect in 
maze learning (Muenzinger, 1934), 
selectivity in memory (Bartlett, 1932), 
and the effects of grouping on percep- 
tion (Kéhler, 1929). Studies using 
the method of absolute judgment are 
investigating stimulus distinctiveness, 
and several scaling methods have been 
proposed (Garner & Hake, 1951). 
Also, it would seem reasonable to 
assume that stimulus distinctiveness 
would be a critical variable in many 
studies of discrimination learning. 

One difficulty with the concept of 
distinctiveness is that no generally 
accepted method of measurement has 
been developed. It is all very well to 
say that attaching a label to an un- 
familiar stimulus increases its distinc- 
tiveness, but how distinctive was it 
initially and how much has the label 
increased the distinctiveness? Put- 
ting one flower pot out of line or 
typing the seventh word in a serial 
list in red may, in both cases, increase 
the distinctiveness of the stimulus, 
but again how much? Of course, it 
is always possible to determine em- 


pirically the distinctiveness of stimuli 
by using an identification experiment 
of some sort (e.g., Austin & Sleight, 
1952), but then it is no longer possible 
to determine the effects of distinctive- 
ness on accuracy of identification 
because there is no independent meas- 
ure of distinctiveness. It would seem 
desirable to develop a quantitative 
measure of distinctiveness that could 
be derived without recourse to pre- 
liminary experimentation so that the 
general effects of this variable could 
be determined. 

In this paper we would like to 
suggest a simple and obvious way 
to quantify stimulus distinctiveness. 
The method was developed by making 
a few assumptions about the processes 
involved and then stating these as- 
sumptions in numerical terms. How- 
ever, at the outset it should be made 
clear that the proposed method is 
only applicable to a limited class of 
stimuli; specifically, stimuli which 
vary in one dimension. Further, at 
first we shall restrict the stimuli to 
those which vary in magnitude or 
intensity; thus, such stimuli as tones 
or lights of varying intensity, weights 
of varying magnitude, or time in- 
tervals of varying duration. While 
this restriction certainly limits the 
generality of the method, it is often 
advantageous to attempt to under- 
stand a phenomenon at a simple level 
before studying it at a more complex 
level. Also, it sometimes happens 
that an explanation designed to apply 
only under rather restricted condi- 
tions turns out to have somewhat 
wider applicability. 

Essentially, the distinctiveness of a 
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given stimulus is the extent to which 
it “stands out’’ from other stimuli. 
With a finite group of stimuli which 
vary only in magnitude, it is assumed 
that the distinctiveness of any given 
stimulus is a function of the difference 
between it and all other stimuli in the 
group. Numerically, the distinctive- 
ness of any given stimulus would be 
the sum of the differences between it 
and all other stimuli in the group. 
Thus, with four stimuli whose magni- 
tudes were 2, 4, 6, and 8 units, the 
difference between the first stimulus 
and each of the other three would be 
2, 4, and-6 respectively; the total 
distinctiveness would be the sum of 
these, or 12. Similarly, the distinc- 
tiveness of the second, third, and 
fourth stimuli would be 8, 8, and 12, 
respectively. The total for the four 
stimuli would be 40, so the percentages 
of distinctiveness of the four stimuli 
would be 30%, 20%, 20%, and 30%, 
respectively. This, then, is basically 


the suggested procedure for quantify- 


ing the distinctiveness of stimuli. 

In any specific situation the magni- 
tude of each stimulus would be the 
magnitude of its physical energy. 
Also, in accordance with the Weber- 
Fechner Law all energy values should 
be transformed into log energy values 
before the computation is started. 
Thus, in the example already given, 
the four stimuli would have physical 
energies of 100, 10,000, 1,000,000, and 


TABLE 1 


ILLUSTRATIVE MATRIX TO OBTAIN 
TD D% 


Log Energy 
Log 
Energy | | | 
(j=1) | G=2) | G=3) | =4) 
—3X2| +4 | +6 | +8 


2 
4 (i=2) 
6 (i =3) 
8 (s=4) 


100,000,000 units with corresponding 
log values of 2, 4, 6, and 8 log units.! 

For computational purposes it is 
unnecessary (and, with any appreci- 
able number of stimuli, unduly 
tedious) to calculate all interstimulus 
differences in order to obtain the total 
distinctiveness (TD) or percentage of 
distinctiveness (D%) of each stimulus. 
As a simpler method it can be shown 
that, as illustrated in Table 1, the TD 
of each stimulus is the sum of its row 
in an m X m square matrix where n is 
the number of stimuli (in the present 
example m= 4). Following con- 
ventional matrix notation the rows 
are @;, 42, @3, . . . a; and the columns 
are @2, @3,...a;. All entries 
below the principal diagonal of the 
matrix (upper left to lower right) are 
obtained by multiplying the log 
energy values of the column by —1, 
and all entries above the principal 
diagonal of the matrix are obtained by 
multiplying the log energy values of 
the column by +1. The principal di- 
agonal itself is obtained by multiplying 
the log energy value of the column (or 
row) by the term, — [m — (2% — 1)]. 
This term always forms an arithmetic 
progression by 2 from — (m — 1) to 
(n — 1). To obtain TD for any par- 
ticular group of stimuli all that is 
necessary, then, is to write the matrix 
and sum the rows; D% is obtained 
merely by expressing TD as a per- 
centage. 

An even simpler computational 
procedure is illustrated in Table 2. 
With the method illustrated in Table 2 
it becomes possible, knowing the log 
energy values, to obtain the necessary 
sums without actually writing out the 


1 The measure of distinctiveness suggested 
here is in a way quite similar_to Helson’s 
adaptation-level theory (Helson, 1947). Both 
theories assume a log function, and both 
theories emphasize the effect that all the 
stimuli in a group have upon one particular 
stimulus. 
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18 
TABLE 2 
ALTERNATIVE METHOD To Compute TD 
| aj| —[{n—(2i—1) Jax | a; | TD 
2 | 0 | | 418./12 
4 | -2 +14] 8 
6 | -6 |-[-1]}6=+6| +8] 8 
8 | -12 |-(-3B=+24) 0/12 
40 


matrix. The first column gives the 
sum of all entries below the principal 
diagonal, the third column the sum bf 
all entries above the principal di- 
agonal, and the second column gives 
the principal diagonal itself. ‘Al- 
though the matrix notation may 
appear formidable, in practice thd en- 
tries in the table are extremely easy to 
obtain. The first and the third col- 
umns can be found by a cumulative 
summation of the log energy values, 
summing down for the first column 
and summing up for the third column. 
The second column is found by multi- 
plying the log energy value of that 
row by the appropriate term in the 
arithmetic progression mentioned in 
the previous paragraph. Only these 
three columns are needed irrespective 
of the number of stimuli, and as 
before TD is found by summing each 
row. As an arithmetical check the 
sum of TD should aiways be twice the 
sum of the second column. 


VALIDITY 


The measure of distinctiveness to be 
used is percent distinctiveness, or D%. 
As a measure of distinctiveness, D% 
can be validated by determining, the 
extent to which it can predici accuracy 
of identification in the method of 
absolute judgment. With the method 
of absolute judgment it seems reason- 
able to assume that performance is 
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primarily a function of stimulus dis- 
tinctiveness; the more distinctive 
stimuli will be correctly identified 
relatively often and the less distinctive 
stimuli less often. Therefore, if D% 
is valid it should predict performance 
in absolute judgment. So if, for a 
given stimulus, D = 13%, then 13% 
of all the correct identifications should 
occur to that particular stimulus. 

We would like to present the results 
of six experiments that test the 
validity of the proposed measure of 
distinctiveness. ‘The first three ex- 
periments were conducted specifically 
for this purpose, and the second three 
experiments were taken from the 
literature. 


Procedure 


In each of the three experiments to be 
reported the stimuli were 1,000 cycle tones 
of varying intensities. Group testing was 
used, and in each experiment there were nine 
different intensities spaced over a 40 db 
range.’ Each intensity was to be identified 
by a number from 1-9, and numbers were 
assigned to stimuli in increasing order of 
magnitude. Each experiment used the 
method of absolute judgment with knowledge 
of results. Two seconds after a_ verbal 
“Ready” signal the tone was presented for a 
two-second period. Then, the Ss had five 
seconds in which to write down their judgment 
(i.e., the number of the stimulus). At the 
end of this five-second period E called out the 
correct identification of the stimulus, and 
three seconds after this identification the next 
“Ready” signal was given. 

In each experiment the nine tones were first 
presented, once i: ascending and then ence in 
descending order of magnitude. Following 
these two runs the testing began. Each of 
the nine tones was presented 10 times in 
random order subject to the following limita- 
tions (which were not reported to Ss): (a) the 
same tone could not occur twice in succession, 
and (6) each tone had to occur twice in every 


*As group testing was used the intensity 
level at the eardrum differed from S to S; 
therefore, the db scale is only a relative scale. 
To give an average value, the intensities used 
were approximately in the range of 40 to 80 
db above threshold. 
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TABLE 3 


PREDICTED AND OBTAINED RESULTS FOR THREE DIFFERENT 
EXPERIMENTS ON LOUDNESS 


Exp. II 


Exp. III 


Inten- 


sity Pred. 


} 


Pred. | 


24.6% 
12.8% 
9.4% 
8.2% 
7.8% 
8.0% 


Odb 
17db 
24db 
28db 
32db 
34db 
36db | 8.6% 
38db | 9.6% 
40db | 11.0% 


15.4%+1.8% 
9.5% +2.3% 
10.1% +1.5% 
8.2% +1.3% 
7.8% 1.5% 
7.6% 1.8% 
11.4% +1.6% 
13.i%+1.5% 
16.6% 41.4% 


Odb 

5db 
10db 
15db 
20db 
25db 
30db 
35db 
40db 


12.6% 
11.7% 
10.9% 
10.1% 

9.3% 
10.1% 
10.9% 
11.7% 
12.6% 


12.8% +2.1% 
12.8% +1.7% 
9.2%+1AG 
7.6% 42.1% 
12.1% +2.2% 
10.6% +1.8% 
10.0% +2.1% 
11.9% +2.0% 
12.8% +1.7% 


19.6% +1.2% 
13.1% +1.7% 
11.7% +1.6% 
8.0% +1.3% 
8.5% 1.7% 
6.8% +1.6% 
8.8% +1.6% 
9.8% +1.7% 
13.6% +1.8% 


18 presentations. A rest of 30 sec. was given 
after 30 and after 60 presentations, and the 
entire procedure for each experiment required 
approximately 20 minutes. The Ss were 
students of both sexes from the author's 
Introductory Psychology class. The same 
students were tested for all three experiments, 
and the three experiments were co.iducted on 
separate days. The same classroom was used 
for all three experiments, and there were 27, 
25, and 26 .s, respectivety. 

The stimuli were produced by a Heathkit 
Audio Generator (Model AG-9) terminated 
by an (internal) 600 ohm load; the two highest 
positions on the attenuator switch were not 
used. The signal from the Audio Generator 
was fed into the low-gain input of a Heathkit 
7-watt amplifier (Model A-7E) and then into 
a Heathkit 12” loudspeaker (Model 401-6). 
At all times the output from the amplifier was 
monitored on an oscilloscope to insure that no 
observable distortion was occurring. 


Results 


The main results are shown in Table 
3 which gives, for each experiment, 
the spacing of the tones, the predicted 
percentage correct (i.e., D%), and the 


obtained percen.age correct. As can 
be seen, the differences between pre- 
dicted and obtained values were in 
general rather small. For an over-all 
measure of agreement the “standard 
ercor of estimate’ was 1.3%, 2.1%, 
and 1.5% for the three experiments, 


respectively. That is, in each experi- 
ment two-thirds of the predicted 
values were correct within the per- 
centage given. The only difference 
between the present usage of the 
standard error of estimate and the 
more conventional usage is that here 
tue predicted values were obtained 
before the experiments were con- 
ducted ; in the usual application of the 
standard error of estimate the pre- 
dicted values are obtained from the 
regression line after the data have 
been collected. 

One indication of the reliability of 
the data is shown in Table 3 where the 
99% confidence interval is indicated 
for each of the nine stimuli in each of 
the three experiments. For all three 
experiments the median 99% confi- 
dence interval was 1.7%, and the 
largest was only 2.3%. A reliability 
coefficient was also determined for 
each experiment. The Ss were ar- 
ranged alphabeticaliy and the mean 
percentage correct for each stimulus 
in the odd-numbered group was com- 
pared with the mean percentage cor- 
rect for each stimulus in the even- 
numbered group. The reliability 
coefficients were .95, .95, and .65 for 
the three experiments, respectively. 
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Related Experiments 


We have also found three experi- 
ments in the literature whose results 
are presented so that they can be 
analyzed in approximately the same 
way that the above results were 
analyzed. Although the sample is 
probably not exhaustive it is hoped 
that it is at least representative. In 
each experiment the method of analy- 
sis was as follows: First, D% was 
calculated from the data on stimulus 
magnitudes presented by the author. 
In one of the three experiments the 
number of correct responses to each 
stimulus was reported so the predicted 
and obtained results could be directly 
compared. In the other two cases 
the data was reported in terms of re- 
sponse uncertainty. Although this is 
not the same as correct responses, we 
have found that with some of our own 
data the correlation between the two 
measures is very high (—.98, —.98, 
and —.91 respectively for the three 
experiments just reported). Also, it 
seems reasonable that as distinctive- 
ness decreases uncertainty should in- 
crease. However, we cannot yet 
predict uncertainty from D%, so in 
these two cases all we can do is to 
report the value of the correlation 
coefficient between D% and _ un- 
certainty. 

Garner (1953) used the method of 
absolute judgment to study the accu- 
racy of identification of 20 tones of 
different intensity ranging from 15 to 
110 db in 5 db steps. In the upper 
graph of Fig. 8(p. 378) he presented 
the response uncertainty in bits for the 
20 stimulus categories. The values 
were read from this graph, and the 
correlation between D% and uncer- 
tainty was —.92. 

Eriksen and Hake (1957) used the 
method of absolute judgment io study 


the accuracy of identification of 20. 


squares of different areas ranging from 


TABLE 4 


PREDICTED .AND OBTAINED RESULTS FROM 
ERIKSEN & HAKE (1957) 


Area (mm) Pred. Obt 
2 12.4% 11% 
4 8.8% 10% 
6 6.8% 9% 
8 5.7% 6% 

10 4.9% 5% 
12 44% 5% 
14 4.0% 4% 
16 3.8% 3% 
18 3.6% 3% 
20 3.6% 3% 
22 3.6% 3% 
24 3.6% 3% 
26 3.7% 3% 
28 3.8% 3% 
30 4.0% 3% 
32 4.2% 3% 
34 44% 3% 
36 4.7% 4% 
38 4.9% 5% 
40 5.1% 7% 


2 to 40 mm. on a side in 2 mm. steps. 
In their Fig. 2 (p. 137) is shown what 
percentage of the total presentations 
of each square were correct for two 
different groups. For each stimulus 
the average of the two groups was 
determined from Fig. 2, and from this 
data was calculated the percentage of 
the total number of correct identifica- 
tions that occurred to each stimulus.’ 
The predicted and obtained results 
are shown in Table 4, and the standard 
error of estimate is 1.0%. 

Alluisi and Sidorsky (1958) used the 
method of absolute judgment to study 


*The procedure of obtaining the total 
number of correct identifications to each 
stimulus and then obtaining percentage 
correct is not mathematically identical to 
getting percentage correct for each S and then 
averaging across Ss. Furthermore, it has the 
disadvantage of giving no indication of 
variability. However, we have found that in 


practice the two results are often almost in- 
distinguishable, and the former method is 
necessary when the data of individual Ss is 
not available. 
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the effects of knowledge of results and 
number and spacing of stimuli on the 
accuracy of identification of circles of 
light of varying diameter. Their 
results were read from their Fig. 3 
(p. 90). The correlations between re- 
sponse equivocation (uncertainty) and 
D% were —.94 for Part I (15 stimuli 
and knowledge of results), —.91 for 
Part II (same stimuli but no knowl- 
edge of results), —.75 for Part III 
(8 stimuli with no knowledge of re- 
sults), —.73 for Part IV (same as Part 
III except that stimuli were spaced 
more closely together). 

The results of these six studies 
provide fairly convincing evidence 
that D% is a valid measure of dis- 
tinctiveness when the criterion is 
the accuracy of identification in the 
method of absolute judgment. Not 
only are the correlations between D% 
and an uncertainty measure quite 
high, but also D% can predict to 


within a few percentage points the 
actual percentage of correct identifica- 
tions that occur to each stimulus. In 
view of these findings, a further ex- 

‘amination of the distinctiveness meas- 
ure would appear to be in order. 


THe D SCALE. 


What we have called D% is a scale 
of distinctiveness; that is, a set of 
numbers which results from measuring 
the stimulus intensities and then 
following certain arithmetical proce- 
dures. The D scale has no units as 
the numerical values are expressed in 


* Throughout the paper the reader should 
bear in mind that D% predicts relative but 
not absolute accuracy. That is, given the 
total number of correct responses to a group 
of stimuli, D% can predict how many correct 
responses should occur to each of the in- 
dividual stimuli. However, D% cannot 
predict, either for one stimulus or for a group 
of stimuli, the total number of correct re- 
sponses that will occur given, say, a fixed 
number of presentations of the stimulus or 
stimuli. 


percentages. The scale values can 
range from a minimum of 0% to a 
maximum of 50%. The value of 0% 
is the limit that D approaches as n, 
the number of stimuli, approaches in- 
finity. With only one stimulus no 
value of D can be determined. The 
concept of distinctiveness refers to the 
relationship between a given stimulus 
and one or more comparison stimuli, 
and if there are no comparison stimuli 
the concept of distinctiveness is simply 
not applicable. With just two stimuli 
D must always be 50% for each stim- 
ulus; with more than two stimuli 50% 
is the limit that D approaches for an 
end stimulus as it gets further and 
further away from all the other stimuli 
on the continuum and they, in turn, 
get closer and closer together. 

To say that the distinctiveness of 
each stimulus with an n of 2 is always 
50% or to say that 50% is the limit 
the distinctiveness an end stimulus 
approaches under certain conditions 
when n is rreater than two are state- 
ments » » scale values, or D%. 
Whe M ctually happen if these 
conc. «1s were tested by the method 
of absolute judgment? It is possible 
to answer this question by making a 
simple probability analysis; that is, 
determine the expected probability 
under these conditions and then con- 
vert the probabilities into percentage 
correct. With two stimuli the percen- 
tage of correct identifications of each 
should always be 50%. This would 
hold true irrespective of the similarity 
between the stimuli. If they were 
very dissimilar each of the two should 
always be correctly identified, and if 
they were very similar (in the extreme 
case, identical) by chance each would 
be correctly identified half the time. 
But irrespective of the absolute level 
of accuracy, percentage-wise 50% of 
the correct identifications that did 
occur should always occur to each of 
the two stimuli. 
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With more than two stimuli, the 
extreme case occurs when one end 
stimulus is always correctly identified 
and the other stimuli are never con- 
fused with it but always confused 
with each other. With m stimuli each 
presented k times (where & is the 
same for all stimuli) the end stimulus 
will be correctly identified & times and 
each of the other stimuli will be cor- 
rectly identified by chance k/(m — 1) 
times. For the end stimulus the per- 
centage of correct identifications is 
given by the expression 


100% [1] 


n—1 


This expression is always (k/2k) 
xX 100%, or 50%. The value of 50% 
is also the limit that D% approaches 
under these conditions. Thus, an 
analysis based on scale values and an 
independent probability analysis lead 
to the same conclusions; the agree- 
ment seems to provide further evi- 
dence of the validity of the D scale. 
The above analysis also suggests a 
distinction between stimulus similarity 
and stimulus distinctiveness. Simi- 
larity refers to two stimuli, distinctive- 
ness to more than two stimuli. The 
similarity between two stimuli (or 
between a single standard stimulus 
and each of a number of different 
comparison stimuli taken one at a 
time) is a function of the difference 
between the two stimuli; the smaller 
the difference the greater the simi- 
larity. Distinctiveness on the other 
hand is the extent to which a given 
stimulus stands out from other stimuli, 
and requires a minimum of three 
stimuli to be applicable. Considered 
in this way similarity and distinctive 
ness are independent concepts; any 
two stimuli can vary widely in simi- 
larity even though D%, were it appro- 
priate, would always be 50% for each. 


And with more than two stimuli the 
concept of distinctiveness applies but 
similarity does not. 

Even though it does have a zero 
point, the D scale appears to be a type 
of scale characterized by Stevens 
(1957) as a logarithmic interval scale. 
That is, D% does not change if all 
stimulus values are multiplied by 
some constant greater than zero; also, 
D% does not change if all stimulus 
values are raised to some power 
greater than zero. The latter invari- 
ance may turn out to be particularly 
significant... Under ideal conditions 
(admittedly seldom encountered in 
practice) the energies of such stimuli 
as sounds and lights fall off as the 
square of the distance. Thus, the 
ratios of the energy values of a group 
of lights observed from a short dis- 
tance will be quite different from 
these same ratios at a greater distance, 
yet D% will be the same because 
distinctiveness is a logarithmic in- 
terval scale. Insofar as contrast 
effects are related to distinctiveness 
(as, for instance, in a black and white 
photograph) one would then expect 
the contrast effects to be independent 
of distance even though the ratios of 
the stimulus energies changed mark- 
edly. Also, it should be mentioned 
that this invariance was really a 
necessary condition for the group test- 
ing of loudness; with Ss at different 
distances from the sound source the 
absolute sound level as measured in 
db’s differed from S to S. However, 
D% was unaffected by this and so the 
same for all Ss. 

On the other hand D% will change 
if a constant is added to or subtracted 
from one or more of the stimulus 
values. This would mean, for in- 

ance, that if one or more stimuli 
were added to (or taken away from) 
the initial group of stimuli the D% 
values for the original (or remaining) 
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stimuli would be changed. This, 
however, is reasonable; the distinc- 
tiveness of a given stimulus is always 
relative to other stimuli, and if the 
other stimuli are changed the dis- 
tinctiveness of the original stimulus 
should also change. 

One of the advantages of the D 
scale is that, being formulated in 
quantitative terms, specific predic- 
tions can be made and tested. For 
instance, the stimulus spacing used in 
the third of the three loudness experi- 
ments resulted from an attempt to set 
up an equal-interval D scale. That 
is, the difference in distinctiveness 
between the first and second stimulus 
was the same as the difference between 
the second and third stimulus, this in 
turn was the same as the difference 
between the third and fourth stimulus, 
and so on. To space the stimuli 


properly it was necessary to work 
backward. That is, we started with 
equations which gave D% for each 


stimulus, found successive differences 
and set them equal to each other, put 
them in the form of simultaneous 
equations and, given the numerical 
values of the middle and end stimuli 
(i.e., 0, 20, and 40 db), solved for the 
stimulus values. The stimulus spac- 
ing for an equal-interval D scale cov- 
ering a 40 db range is shown in Table 3 
under the heading of “predicted’”’ 
values for Experiment III; the fact 
that the standard error of estimate for 
this experiment was only 1.5% sug- 
gests that the predicted equal-interval 
D scale was fairly well confirmed by 
the data.® 


5 The stimuli in the first loudness experi- 
ment were spaced so that the predicted values 
(i.e, D%) would be symmetrical. The 
stimuli in the second loudness experiment 
were spaced so that the values predicted on 
the basis of a power function (as opposed to 
a log function) would be symmetrical. How- 
ever, the predicted values shown in Table 3 
for the third experiment are those predicted 
on the basis of the log function. 


On the basis of this same general 
procedure several other conclusions 
about the nature of the D scale may 
be drawn. For instance, it can be 
proved that irrespective of the spacing 
of the stimuli the middle stimulus 
(when n is odd) or stimuli (when n is 
even) must always be the least dis- 
tinctive. Also, irrespective of the 
spacing of the stimuli there must be a 
monotonic decrease in the distinc- 
tiveness of the stimuli in going from 
either end toward the middle. There- 
fore, assuming no two ‘stimuli are 
identical it is literally impossible to 
achieve an “equal-distinctiveness” 
scale ; that is, a group of stimuli (larger 
than two) where all stimuli have the 
same D% value. In this connection 
Garner and Hake (1951) discuss a 
“scale of equal discriminability”’ which 
is a scale derived by adjusting the re- 
sponse categories so as to equalize the 
extent to which any two stimuli are 
called the same response. As the re- 
sults of Garner (1953) show, an equal- 
discriminability scale does not yield 
an equal-distinctiveness scale as the 
response uncertainty increases going 
from each end stimulus toward the 
middle (see Garner, 1953, Fig. 3, p. 
376). Finally, it should be empha- 
sized that the preceding statements 
about distinctiveness refer to scale 
values. While we know of no evi- 
dence that contradicts these state- 
ments, more experimental evidence is 
needed before we can be sure that they 
are substantially correct. 


SERIAL LEARNING 


Having discussed the D scale and 
its validity we would now like to show 
that it can profitably be applied to 
certain serial learning phenomena, in 
particular the bowed serial-position 
curve of verbal learning. Specifically, 
it will be suggested that the shape of 
the serial-position curve results from 
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the unequal distinctiveness of the 
items that comprise the list, and there- 
fore the serial-position curve can be 
predicted from the D scale. 
however, it is necessary to present 
some evidence to support the conten- 
tion that the bowed serial position 
curve is related to distinctiveness. 

Recently McCrary and Hunter 
(1953) have shown that the shape of 
the serial position curve, when plotted 
in the appropriate manner, is net a 
function of such variables as distribu- 
tion of practice, rate of presentation, 
familiarity of the items, or individual 
differences in speed of learning. Also, 
Braun and Heymann (1958) have 
shown that the shape of the serial 
position curve is not a function of 
meaningfulness of material, interitem 
interval, or intertrial interval. Thus, 
the shape of the serial position curve is 
essentially constant despite variations 
in the distribution of practice, rate of 
presentation, familiarity of the items, 
individual differences, meaningfulness 
of material, interitem interval, and 
intertrial interval. Yet we know that 
every one of these variables has a con- 
sistent effect on learning as measured, 
say, by number of trials to criterion. 
Since these variables influence learn- 
ing but do not affect the shape of the 
serial position curve, it would almost 
seem that the shape of the serial 
position curve has nothing whatsoever 
to do with learning. 

If the bowed serial position curve is 
not a manifestation of the learning 
process, then what does it represent? 
As has already been mentioned, we 
would suggest that the curve results 
from the unequal distinctiveness of 
the items in the list. The initial items 
are most distinctive and the middle 
items least distinctive, so the dis- 
tinctiveness is a function of the ordinal 
position in the series. If this seems 
unreasonable, consider an experiment 


First,,. 


reported by McCrary and Hunter 
(1953). The Ss, all graduate stu- 
dents, learned two serial lists; one list 
was composed of the family names of 
fellow graduate students who were all 
well acquainted with each other and 
the other list was composed o: low 
association-value nonsense syllables. 
Although the mean number of trials to 
criterion for the two lists differed 
considerably (11 trials and 39 trials, 
respectively), the shapes of the two 
serial position curves were essentially 
identical. Learning the serial list of 
names involved essentially ordering 
familiar items, and the errors of 
anticipation that occurred could easily 
be related to differences in distinctive- 
ness due to ordinal position within the 
series. The fact that the same curve 
resulted with the nonsense syllables 
suggests that the same process oper- 
ates with unfamiliar material. Fi- 
nally, the fact that learning the list 
with the nonsense syllables required 
many more trials points up the neces- 
sity for distinguishing between learn- 
ing the items and learning the order 
of the items (Hovland & Kurtz, 1952). 

The above argument requires, of 
course, that the serial position curve 
be plotted “in the appropriate man- 
ner.” If the curve is plotted in terms 
of errors, then the serial position curve 
should show what percentage of the 
total errors occurred at each serial 
position. This method of analysis 
has been suggested by McCrary and 
Hunter (1953) and seems eminently 
reasonable ; if the purpose of the curve 
is to show the relative distribution of 
errors, then the errors at each position 
should be presented relative to the 
total number of errors that occurred. 
It could be argued that this method of 
analysis makes it impossible for the 
serial position curve to reflect the 
various experimental variables listed 
above, and therefore the conclusion 
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that the shape of the serial position 
curve has nothing to do with learning 
results merely from the method of 
analyzing the data. This is not true; 
when plotted in percentage form the 
shape of the curve could still show 
variations in response to these ex- 
perimental variables, and it is the fact 
that it does not which is significant. 
If the shape of the serial position 
curve reflects the distinctiveness of 
the items within the list, then the D 
scale should be applicable. One must 
assume that in a serial list the distinc- 
tiveness of any given item denends 
upon its position in the list relative to 
the position of all the other items in 
the list; also, that the ordinal con- 
tinuum of the items is logarithmic. 
With these assumptions D% can be 
determined in the same way as before. 
' To determine the D scale values for a 
serial list of any length all that is 
necessary is to transform serial posi- 
tion into log serial position and then 
calculate D%. The appropriate 
values for serial lists 8- to 15-items 
long are shown in Table 5, and these 


then are the serial position curves 
predicted on the basis of the D scale. 

The importance of Table 5 is that 
it provides a set of predicted results 
with which any obtained data can be 
compared, and these predictions are 
independent of such variables as 
presentation rate, distribution of prac- 
tice, familiarity of material, or learn- 
ing ability of the Ss. To make any 
such comparisons it is only necessary 
to determine the number of correct 
responses at each position and then 
express this as a percentage of the 
total number of correct responses. 
Preferably the percentage correct at 
each position should be calculated 
separately for each individual S and 
then averaged for all Ss. With a 
number of Ss it then becomes a simple 
matter to determine whether or not 
any difference between predicted and 
obtained results are within the limits 
of the experimental error. 

There is, of course, enough knowl- 
edge presently available about the 
shape of the serial position curve to 
state that the predicted curves have 


TABLE 5 


PREDICTED SERIAL PosITION CuRVES FOR Lists OF 8-15 ITEMS 


Serial 


Percentage Correct 


Position 
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| 8 9 10 | 12 13 | 14 15 

“4 22.9% | 21.3% | 19.9% | 18.7% | 17.6% | 16.7% | 15.9% | 15.2% 4 
ae 14.0% | 13.3% | 12.6% | 12.1% | 11.6% | 11.1% | 10.7% | 10.3% 
a 10.4% | 98% | 9.3% | 9.0% | 86% | 83% | 81% | 7.8% 
Pee 9.2% | 84% | 79% | 7.5% | 7.2% | 69% |. 67% | 65% 

me 9.2% | 81% | 7.3% | 67% | 63% | 60% | 58% | 5.6% : 
Prey 10.0% | 84% | 7.3% | 65% | 60% | 5.6% | 5.3% | 51% ‘ 
Lait bes 11.4% | 92% | 7.7% | 6.7% | 60% | 55% | 51% | 48% ;' 
veer, 12.9% | 101% | 83% | 7.1% | 62% | 56% | 51% | 4.8% | 
ae 11.5% | 9.2% | 7.7% | 66% | 58% | 5.3% | 48% i 
ah 10 10.4% | 86% | 7.2% | 63% | 56% | 5.0% 4 
(| i 94% | 7.9% | 68% | 59% | 5.3% be 
ner\s 12 8.7% | 74% | 64% | 5.6% r 
al 13 79% | 68% | 5.9% L 

14 7.5% | 65% 
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approximately the correct shape (i.e., 
bowed and a-symmetrical). As a 
more rigorous test of the theory it is 
necessary to compare predicted and 
obtained values as was done with 
studies of absolute judgment. Un- 
fortunately, however, most of the 
published studies report data in terms 
of errors, and errors cannot be trans- 
lated into percentage correct without 
the data of individual Ss. There is, 
however, one recent experiment where 
the data were presented in sufficient 
detail to make the necessary analysis. 
Bugelski (1950) used an eight-item 
serial list, and the obtained results 
were presented in his Table 2 (p. 340). 


TABLE 6 


PREDICTED AND OBTAINED RESULTS FROM 
BuGELsKI (1950) 


Serial Position Predicted Obtained 
1 22.9% 23.8% 
2 14.0% 18.1% 
3 10.4% 12.3% 
4 9.2% 9.2% 
5 9.2% 7.9% 
6 10.0% 6.7% 
7 114% 8.9% 
8 12.9% 13.1% 


These results and the results predicted 
by the D scale are shown in Table 6; 
the standard error of estimate is 2.2%. 
Thus, in at least this one case the 
predictions are fairly accurate. Pa- 
renthetically it should be mentioned 
that Bugelski counterbalanced the 
items among serial position, a neces- 
sary precaution to insure that serial 
position is not confounded with the 
items (or sequence of items) within 
the list. 

Ideally the applicability of the D 
scale to the serial position curve would 
be determined by comparing its pre- 
dictive power to that of other theories. 
Unfortunately, however, no other 


theories have been worked out in 
sufficient detail to permit quantitative 
predictions to be made.* There is 
little doubt but what the D scale can, 
with a fair degree of accuracy, predict 
the shape of the serial position curve ; 
it remains to be seen whether different 
theories can make predictions which 
are consistently more accurate. 

The chief purpose of Bugelski’s 
experiment was to study remote 
forward associations. Since remote 
associations and the bowed serial 
position curve are often considered to 
be interrelated, it seemed appropriate 
to apply the D scale to Bugelski’s 
data on remote forward associations. 
We determined the average distinc- 
tiveness for each degree of remoteness 
and compared these theoretical values 
with the data on remote forward 
associations presented by Bugelski in 
Table 4 (p. 341). The correlation 
coefficient was —.81, indicating that 
as distinctiveness increased the mean 
number of remote forward associa- 
tions decreased. 

Hull (1948) has suggested that the 
principles of rote serial learning on the 
human level and maze learning on the 
animal level may be very similar. If 
the animal is reinforced after each 
correct choice (as opposed to the more 
conventional procedure of reinforcing 
the animal only at the end of the 
maze) the maze learning situation 
becomes almost identical to that of 
verbal serial learning. In a maze- 
learning study using this procedure 
Hull (1948) obtained a serial position 
curve very similar to the bowed serial 
position curve of verbal learning. As 


* The Hullian theory (Hull et al., 1940) is 
an exception to the above statement. How- 
ever, it is seldom used and we have never seen 
a priori predictions about either the number 
or percentage of errors that should occur 
at each serial position. Perhaps the present 
approach will stir proponents of the Hullian 
theory into making greater use of it. 
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a further test of the D scale the pre- 
dicted values were determined in the 
usual manner, and the obtained results 
were taken from Hull’s Table 3 (p. 
20). The predicted percentages of 
correct responses at the four choice 
points were 35%, 20%, 20%; and 
26%; the obtained percentages of cor- 
rect responses were 30%, 23%, 21%, 
and 26%, respectively. If it is mean- 
ingful to compute a standard devia- 
tion on four values, the standard error 
of estimate is 3%. Thus, under the 
appropriate conditions perhaps the D 
scale can even be related to choice- 
point behavior in a maze. 

To conclude this section on serial 
learning one basic problem should be 
mentioned. The D scale is predicated 
on the assumption that distinctiveness 
does not change. Thus, in using the 


method of absolute judgment the ac- 
curacy of identification in the early 
trials should be the same as in the late 
trials. An analysis of our data showed 


this to be the case; there were no con- 
sistent changes in accuracy during the 
course of practice in any of the three 
experiments. The same is not true of 
serial learning, as has been shown by 
Ward (1937) for verbal learning and 
Spence and Shipley (1934) for maze 
learning. This means that the D 
scale can only predict within a cer‘ain 
range of trials; if too few trials or too 
many trials are given the theory 
probably should not be applied. Un- 
fortunately it is impossible to state 
now what the allowable range of trials 
is; this will have to be determined by 
further research. 

However, to demonstrate that this 
restriction is not an insurmountable 
objection we would like to present the 
results of two rote-learning experi- 
ments where the D scale can be ap- 
plied even though systematic changes 
in performance were taking place. 


TABLE 7 


PREDICTED AND OBTAINED RESULTS FROM 
LEARNING EXPERIMENT WITH WEIGHTS 


Weight (lbs.) Predicted Obtained 


21.3% 

14.8% 

10.4% 

9.2% 

9.2% 

10 9.8% 
13 11.7% 
16 13.6% 


17.6% + 1.5% 
13.4% + 1.3% 
11.0% + 1.9% 


Both experiments’ used paired-asso- 
ciate learning and not serial learning. 
In the first experiment there were 
eight weights in :dentical containers 
ranging from 1 to 16 Ibs., and the Ss 
had to learn the correct color name 
for each. Weights and names were 
randomly paired. The standard an- 
ticipation method of paired-associates 
was used with a 15-sec. presentation 
rate (5 sec. for S, 10 sec. for S-R) and 
a two-minute intertrial interval. The 
criterion was one perfect trial. There 
were 20 Ss, and they were tested 
individually. 

The results are shown in Table 7 
and, as can be seen, the predictions 
were fairly accurate. The standard 
error of estimate was 2.0%, and the 
reliability coefficient was .97. To 
determine if learning took place, a 
Vincent-curve technique was used; 
for all Ss, the mean number of correct 
responses for the first, middle, and 
last third of practice were 16, 24, and 
30, respectively. The difference be- 


7 These two experiments were designed and 
conducted by undergraduate psychology 
majors in a course in experimental psychology 
taught by the author. Both experiments had 
been completed prior to the development of 
the D scale so they were in no sense originally 
designed to test it. Lawrence Albert and 
David Kantor conducted the first experiment 
and Alfred Wilder conducted the second ex- 
periment; the analysis and interpretation of 
the data are the author’s responsibility. 
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tween the first and middle third and 
the difference between the middle 
and last third were very significant 
(p < .001) even when tested by the 
sign test. The learning data was so 
consistent that there was not one 
single case of reversal (i.e., more 
correct responses in an earlier stage of 
practice). Thus, there seems to be 
little doubt that learning did occur, 
yet still the theory predicted the dis- 
tribution of correct responses among 
the eight stimuli. 

The second experiment was again 
a standard paired-associate learning 
task; here the stimuli were time in- 
tervals of different duration and the 
responses were letters of the alphabet 
that were randomly assigned to the 
stimuli. There were four conditions, 
and they differed in the range of time 
intervals used. In each condition ad- 
jacent stimuli always differed by 1 
sec., and the ranges were 1-4 sec., 
2-5 sec., 3-6 sec., and 4-7 sec. There 
were 12 Ss tested individually, and all 
Ss served in all four conditions. Prac- 
tice effects were controlled by counter- 
balancing order with a 4 X 4 latin 
square. Under each condition all Ss 
learned to a criterion of two consecu- 
tive perfect trials. 

Here we used, for each stimulus, 
total distinctiveness (TD) and not 
percentage of distinctiveness (D%). 
The latter measure was not used be- 
cause it would obliterate the very 
group differences we were studying. 
For all 16 stimuli (four stimuli in each 
of four conditions) the correlation be- 
tween TD and mean number of errors 
was .87. The number of correct re- 
sponses was not used because the four 
groups differed significantly in number 
of trials to reach criterion (mears of 
2.1, 3.7, 6.7, and 8.3 trials respectively ; 
F = 11.7, P < .001). The fact that 
the correlation between TD and errors 
was relatively high suggests that, 
again, the D scale is applicable. 


Tue FUNCTION 


One of the assumptions underlying 
the development of the D scale is the 
Weber-Fechner Law, and since its 
inception this law has been widely 
criticized. Recently Stevens (1957) 
has been one of the most vocal critics, 
and in view of his recent work on ratio 
scaling it is necessary to justify the use 
of the log function. 

Historically, the Weber-Fechner 
Law developed from Weber’s -Law. 
Weber’s Law states that a jnd is a con- 
stant fraction of the stimulus magni- 
tude; by making a few assumptions 
this statement can be put in the form 
of a differential equation and inte- 
grated, and the result is the Weber- 
Fechner Law. It should be made 
perfectly clear that the assumption of 
a logarithmic function does not de- 
pend upon the validity of Weber's 
Law. For instance, one could simply 
start by assuming a log function and 
justify this assumption by showing, if 
possible, that this led to predictions 
which were verified by experimental 
tests. Or, one could justify this as- 
sumption by citing certain physio- 
logical evidence. It seems to be 
generally accepted (e.g., Hebb, 1958, 
91-98) that the nervous system codes 
intensity of stimulation into frequency 
of nervous impulses; the greater the 
intensity the greater the frequency. 
Further, there seems to be consider- 
able evidence that the frequency of 
nervous impulses is proportional to 
the log of the stimulus intensity. As 
Ruch (1955) says, “‘Whatever its 
original derivation, Fechner’s equa- 
tion appears to express a fundamental 
feature of sense organ behavior. Over 
a certain range of intensities, the fre- 
quency of discharge is a linear function 
of the logarithm of the stimulus” 
(p. 308). Therefore, since intensity 
sensations depend upon frequency, 
and since frequency is proportional to 
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log stimulus magnitude, it would seem 
reasonable to conclude that intensity 
sensations are proportional to log 
stimulus magnitude. 

As a first-order approximation (and 
within a certain range) there really 
can be no doubt that the Weber- 
Fechner Law is “‘true.”’” That is, it is 
more nearly correct to say that sensa- 
tion is proportional to the logarithm 
of the stimulus magnitude than to say 
that sensation is proportional to the 
stimulus magnitude. However, while 
the log function is better than a linear 
function, some third function may be 
better yet. Stevens (1957) has pre- 
sented a convincing case for a power 
function, suggesting that it be con- 
sidered the ‘“‘psychophysical law.”’ 
That is, instead of stating that the 
sensation is proportional to the log- 
arithm of the stimulus magnitude 
Stevens suggests that the sensation is 
proportional to the stimulus magni- 
tude raised to a power m, where n 
assumes different values for different 
perceptual continua. 

Fortunately it is very easy to com- 
pare the log function and the power 
function for the D scale. We have 
already presented the predicted values 
based on a log function; the predicted 
values can also be worked out on the 
assumption of a power function and 
the two compared. We have done 
this in all cases where a standard error 
of estimate was obtained for the log 
function, using as exponents the 
values suggested by Stevens (0.3 for 
loudness, 1.0 for visual area, and 1.45 
for heaviness, all from Stevens, 1957, 
Table 1, p. 166). The comparison 
between the log function and: the 
power function was restricted to these 
experiments because it is felt the 
standard error of estimate provides a 
more rigorous test of the theory than 
does the correlation coefficient. 

Before these results are presented 
one additional point must be made. 


TABLE 8 


STANDARD ERRORS OF ESTIMATE FOR A 
RECTANGULAR DISTRIBUTION, POWER 
FUNCTION, AND LoG FUNCTION 


Power 
| Distribution) Function 


Loudness—I 3.1% 
Loudness—II 3.7% 
Loudness—III 1 ‘3% 
Eriksen & Hake | 

(1957) 2.5% 
Weights | 3.3 


Mean 2.9% 


Experiment 


We have already shown that, in the 
various experiments, the standard 
errors of estimate based on the log 
function vary from 1% to 3%. Un- 
fortunately, these results are not as 
impressive as they might seem. Pre- 
dictions could be made without any 
theory at all, and even these predic- 
tions might not be too inaccurate. 
For instance, one could simply assume 
a rectangular distribution; that is, 
assume all stimuli to be equally dis- 
tinctive so each should be correctly 
identified equally often. The stand- 
ard error of estimate based on the 
hypothesis of a rectangular distribu- 
tion can serve as a convenient refer- 
ence point; this represents what may 
be considered a chance level, analogous 
to predicting the mean criterion score 
for all Ss regardless of test score, and 
any theory must justify itself by mak- 
ing more accurate predictions. 

The standard errors of estimate 
for a rectangular distribution, power 
function, and log function are shown 
in Table 8. In all cases the log func- 
tion had a smaller standard error of 
estimate than the rectangular dis- 
tribution, and in all but one case it 
was also smaller than the power 
function. On the average the power 
function reduced the standard error of 
estimate of the rectangular distribu- 
tion by 24%, but the log function re- 
duced it by 45%. In considering the 
results as a whole it would‘appear_that 
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predictions from a log function are 
more accurate than predictions from a 
power function. 

Not only were the differences be- 
tween predicted and expected greater 
for the power function, but also they 
were more. systematic. That is, by 
and large the differences with the log 
function appeared to be reasonably 
random, the type one would expect 
with chance fluctuation. However, 
the differences with the power func- 
tion often fell into a more consistent 
pattern. For instance, in the Eriksen 
and Hake experiment the power func- 
tion gave predicted values that were 
consistently too low for the smaller 
visual areas but consistently too high 
for the larger visual areas. Discrep- 
ancies of this type cannot very easily 
be attributed to chance fluctuation. 

On a purely empirical basis, then, 
the log function is better than a power 
function for the D scale because the 
log function results in more accurate 
predictions. Also, there is another 
advantage to the log function; it can 
readily be applied to serial learning, 
and this application increases the gen- 
erality of the D scale. At present 
there is no way of applying the power 
function to serial learning phenomena 
because the value of the exponent is 
unknown, and there does not seem to 
be any way that it could be de- 
termined. 


CONCLUSION 


In this paper we have tried to show 
how the concept of stimulus distinc- 
tiveness can be quantified. Evidence 
as to the validity of the D scale has 
been presented, and applications to 
serial learning phenomena have been 

,made. Perhaps further applications 
will be possible. Rothkopf (1957), 
using a similar type of approach, has 
shown that the results of a psycho- 
physical procedure can !ead to predic- 


tions about accuracy of identification 
with stimuli which do not at first 
glance readily lend themselves to 
quantification. Also, a study by 
Eriksen and Hake (1955) suggests 
that stimuli which vary in two or 
more dimensions can perhaps be 
treated in a similar manner. 

Finally, as we have stressed through- 
out, the D scale and its applications 
are to be considered as an approxima- 
tion and not as either “true” or 
“false.” It is for this reason that so 
much stress has been placed on the 
predictive accuracy and so little stress 
has been placed on the customary 
type of statistical analysis. Thus, in 
Table 3 it may be noted that in 


‘several cases the difference between 


predicted and obtained results was 
significant at the 1% level of confi- 
dence. These significant differences 
are of course unfortunate, but they 
do not “disprove” the theory. A 
scientific theory is not a logical or 
mathematical formulation which is 
rendered worthless by one exception. 
Rather, scientific theories are more 
and more coming to be perceived as 
models or approximations, and in 
general the better theory leads to the 
closer approximation. Given alter- 
native theories, what is important is 
the accuracy of prediction (and, of 
course, the additional work stimulated 
by the theory); whether differences 
between predicted and obtained re- 
sults are statistically significant may 
be of only secondary importance. 
And since the D scale does seem to be 
a fairly accurate approximation, it 
should be of use in:studying problems 
which pertain to the distinctiveness of 
stimuli. 
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MANIPULATION OF ITEM MARGINAL FREQUENCIES 
BY MEANS OF MULTIPLE-RESPONSE ITEMS * 


RICHARD H. WILLIS 
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The purpose of this paper is to 
present a general and flexible method 
for manipulating the marginal fre- 
quencies, or popularity values, of 
attitude items.?, The power to adjust 
item marginals would often be a great 
convenience to an investigator, but no 
proposal for accomplishing this has 
been made previously. One general 
advantage of the proposed method, 
which has been named the Method of 
Controlled Marginals (MCM), is that 
it makes the investigator relatively in- 
dependent of the strength of wording 
of items. Item wording is a primary 
determinant of item marginals, and it 
is usually difficult or impossible to 
select the wording which will produce 
a desired marginal. In extreme cases, 
where either all respondents or no 
respondents endorse the item, the 
item loses all discriminating power. 
A second general advantage of the 
method is that it allows the separation 
of effects due to content and those due 
to popularity. It has not been pos- 
sible previously to avoid inextricably 
confounding the two, as, for example, 


‘The author gratefully acknowledges the 
many helpful comments and suggestions 
which have resulted from discussions with 
several persons, most particularly, J. C. Gil- 
christ, Robert E. Krug, and Paul Lazarsfeld. 
Part of the material in this paper was reported 
at the 1959 APA meeting in Cincinnati. 

*Custom dictates that one speak of the 
popularity of an attitude item and of the 
difficulty of an aptitude or achievement item. 
Both terms are used in the present paper, but 
whenever it is convenient to use a generic 
term, popularity will be employed. The 
term marginal will refer either to marginal 
frequency or marginal proportion, depending 
upon the context. 


when inferences are made concerning 
the similarity of content among items 
of a scale from a measure such as 
Guttman’s coefficient of reproduci- 
bility, which is a joint function of both 
similarity of item content and the 
distribution of item popularities (see, 
e.g., Borgatta, 1955; Menzel, 1953; 
Willis, 1954). Before discussing the 
uses and advantages of the MCM in 
greater detail, the general procedure 
will be described. 

The control of item popularity 
values is achieved by obtaining an 
essentially continuous distribution of 
responses to each item by means of a 
graphic rating scale, thus allowing the 
cutting point between endorsement 
and nonendorsement to be located as 
desired. Multiple-response items— 
items which require more than one 
response from each respondent—are 
particularly well suited for this pur- 
pose. For example, the item might 
read ‘“‘Laplanders are good people: 
(a) Agree, (b) Disagree,” and the 
respondent would circle either (a) or 
(b). There would follow a graphic 
rating scale on which the respondent 
would indicate how strongly he agreed 
or disagreed with the item by placing 
a mark at any point along the scale. 
Except for occasional ties, the re- 
spondents would be ranked on the 
basis of the one item, and the sample 
could be dichotomized at any point. 
Figure 1 shows the item as answered 
by a respondent who agreed quite 
strongly with the statement. 

Note that the MCM gives all the 
data necessary for the location of non- 
arbitrary zero points by the method of 
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LAPLANDERS ARE GOOD PEOPLE 


AGREE 
B. DISAGREE 


MILOLY 


MODERATELY 


STRONGLY 


° 10 20 30 40 50 


Fic. 1. 


100 


60 70 80 90 


A standard type multiple-response item, as answered 


by a respondent who strongly agrees. 


Guttman (1947) and Suchman (1950), 
although this is not a necessary part 
of the technique. Note also that 
many variations on the basic scheme 
described above are possible (Willis, 
in press). 

In a number of situations, of course, 
there would be no substantial ad- 
vantage in altering the original dis- 
tribution of item marginals. If a 
normal distribution were desired, and 
if the unadjusted marginals were dis- 
tributed approximately normally, it 
would not be worth even the moderate 
extra effect it would take to adjust 
marginals. 

Frequently, however, the obtained 
distribution of marginals will be quite 
inappropriate for the purposes at 
hand, and adjustments will be highly 
desirable. If a decision to adjust 
marginals is made, the distribution 
sought will depend upon the purpose 
of the investigation. A _ particular 
investigation usually has one (or 
possibly both) of two general goals— 
prediction or interpretability. If one 
is primarily interested in prediction, 
the focus of attention is on the correla- 
tion between one’s instrument and an 
external criterion. If, on the other 
hand, one is primarily interested in 
giving a clear-cut interpretation to 
one’s measures, one seeks some variety 
of internal consistency among re- 


sponses to items. Concepts and in- 
dices such as single-trial reliability, 
mean item covariation, factorial 
purity, homogeneity, reproducibility, 
and scalability all refer to this internal 
consistency aspect of an instrument. 
To the extent that no internal con- 
sistency, either unidimensional or 
multidimensional, can be found, it is 
not possible to give a conceptually 
unambiguous interpretation to one’s 
data. 

So far as the distribution of item 
marginals bears on the problems of 
prediction and interpretability, two 
principles may be stated : 


1. In order to maximize predictive 
power, the proper combination of 
item content and distribution of 
item marginals is required, the 
optimal distribution depending on a 
number of variables. 

2. In order to maximize interpret- 
ability, complete separation of con- 
tent and popularity or difficulty is 
required. This can be achieved by 
adjusting all item marginals to be 
equal. Alternately, one can hold 
content constant and vary the 
popularity or difficulty, by using 
more than one cutting-point with a 
single item. 


Now we consider these two problems 
in turn. 
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IMPROVING PREDICTIVE POWER 


A necessary (but not sufficient) 
condition for predictive power in an 
instrument is that it possess the 
capacity to discriminate among re- 
spondents, as reflected by a measure 
such as the variance of the distribution 
of scores about its mean. Ordinarily, 
an investigator has no way of insuring 
against a highly skewed distribution 
of scores. The greater such skew, the 
greater the loss of discriminating 
power. Without some safeguard, 
such as that supplied by the MCM, an 
investigator is always at the mercy of 
the unpredictability of the relation 
between the specific wording of an 
item and the proportion of respond- 
ents in his sample that will endorse 
it. With the power to manipulate 
item marginals, one is almost entirely 
independent of item wording insofar 
as it affects the distribution of scores. 

An experience of the author well 
illustrates the unfortunate result of 
having no control over the score dis- 
tribution. In a study of the relation 
between political and child-rearing 
attitudes (Willis, 1956), conducted 
before the development of the MCM, 
one of the scales attempted to tap 
hostility feelings towards parents. 
Despite a pretest, almost all items on 
this scale were worded so strongly as 
to make them unacceptable to all but 
a small minority of the respondents. 
The distribution of scores was so 
skewed as a result that the discrim- 
inating power was nil, and the scale 
had to be discarded. 

By means of the MCM an investi- 
gator can salvage otherwise lost items 
and, in cases similar to that above, 
whole scales may sometimes be saved, 
providing there is some variance in the 
responses made on the graphic scales 
of the items. Even in the unlikely 
event that no appreciable varjance 
occurs in the graphic scale responses 
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for many items, there still exists the 
possibility of extracting all available 
information from the remaining items 
by using the graphic scale responses to 
obtain a complete score distribution 
from each of the discriminating items. 

A more ustal situation, however, 
is that of an instrument which dis- 
criminates fairly well among respond- 
ents. It still may be possible to 
increase its predictive power sub- 
stantially by the proper choice of the 
distribution of item marginals. De- 
termining the exact distribution which 
will maximize the validity coefficient 
is a problem which remains unsolved 
in its general form, but the general 
picture can be sketched out. We 
rely heavily on the semiempirical 
results of Brogden (1947) in the 
discussion that follows. The crucial 
variables* are (a) item reliabilities, 
(6) item validities, (c) item inter- 
correlations, (d) the number of items, 
or length of test, and (e) the nature of 
the predicted criterion—that is, 
whether it is continuous, such as a 
grade point average, or dichotomous, 
such as “pass” or “fail.” For a 
dichotomous criterion, a maximal or 
near-maximal coefficient of validity 
will be obtained if one maximizes the 
total number of item discriminations 
made at the level at which the cri- 
terion is dichotomized, providing that 
item intercorrelations (as measured 
by tetrachoric correlation coefficients) 
are low or moderately low. That is, 
if it is known in advance that about 
15% of all individuals will “pass,” a 
single item will have greatest pre- 
dictive power when its marginal pro- 
portion is equal to .15. A test or 
scale composed of items which do not 
intercorrelate too highly will have 
highest validity when all items have 


* The additional complication of allowing 
differential item weights will not be con- 
sidered. 
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marginal proportions of about .15. 
With high mean item intercorrela- 
tions, say about .50 or over, it will be 
advisable to include a range of item 
marginals. In fact, as the mean item 
intercorrelation approaches unity, the 
optimal distribution of item marginals 
approaches rectilinearity. If no be- 
forehand information is available as 
to the proportion of the sample 
eventually “‘passing,” setting all item 
marginals at .50 is probably the best 
tentative solution, for this gives the 
absolute maximum number of dis- 
criminations for each item, as well as 
maximizing obtained information in 
the information-theory sense. This 
adjustment of all item marginals to 
.50 will be referred to hereafter as 
response-balancing the items. It will 
probably not be of great advantage in 
most cases to include a range of item 
marginals, for a mean item inter- 
correlation value of .50 or greater is 
rarely attained in practice, except 
possibly when a special effort has been 
made to do so. 

For the prediction of a continuous 
criterion, Brogden’s results may be 
stated briefly as follows: With mean 
item tetrachorics which are low or 
moderately low (under .50, approx- 
imately), it is advantageous to cluster 
items about a difficulty or popularity 
level of .50, although the advantage 
diminishes as the number of items 
increases. Several early, empirical 
studies (Cook, 1932; Richardson, 
1936; Thurstone, 1932) verify this 
finding. With item intercorrelations 
as high as .80, however, response- 
balancing is disadvantageous, as com- 
pared to incorporating a range of 
marginals, and becomes more dis- 
advantageous as the number of items 
increases. Brogden also points out 
that low item validities favor response- 
balancing, and presumably the same 
could be said of low item reliabilities. 


In summary, Brogden’s results in- 
dicate that both low item intercorrela- 
tions and a small number of items (say 
20 or less) favor the use of uniform 
item margiuals. Except when item 
intercorrelations are unusually high, 
we may say that for the prediction of a 
continuous criterion, items should be 
response-balanced, whereas for the 
prediction of a dichotomous criterion, 
marginals should be grouped in the 
vicinity of the proportion of the 
sample which is to be passed or 
selected, if this is known. 

Where item intercorrelations are 
extremely high, a phenomenon may be 
observed which has been labelled ‘‘the 
attenuation paradox’”’ by Loevinger 
(1954). This “paradox”’ refers to the 
fact that increasing item intercorrela- 
tions beyond a certain point, although 
resulting in a continued increase in 
test reliability, attenuates test valid- 
ity. It is Loeviriger’s recomménda- 
tion that this paradox be resolved by 
the use of two rules for test construc- 
tion. Inher words: ‘For the ‘classical 
region,’ the region in which the at- 
tenuation of validity decreases with 
increase in reliability, the closer the 
items are'to difficulty of .5 and thus 
to equivalence, the more reliable and 
more valid will the test be. For the 
‘region of paradox’ the optimal dis- 
tribution of item difficulties must be 
determined as a function of item 
intercorrelations."" The region of 
paradox, of course, refers to very high 
item intercorrelations,. while the clas- 
sical region refers to more moderate 
item intercorrelations. In general, 
some dispersion of item marginals 
would usually be desirable in the 
region of paradox, especially if the 
test were to be used with a number of 
groups differing in mean. 

In a more recent article on the 
attenuation paradox, Humphreys 
(1956) outlines the sequence in which 


‘@ 
j 
4 
4 
aay 
3 
ae 
ay 
| 
4 
A 


the test technician should make deci- 
sions in constructing a test. He 
makes the point that the desired level 
of item intercorrelation should be 
selected at the outset. Only then 
should the form of the raw score 
distribution of test scores be selected 
(a rectangular distribution is recom- 
mended for a general purpose test). 
And finally, the test technician should, 
according to Humphreys, “‘strive to 
obtain the desired form of distribution 
by varying item difficulties only.” 
The MCM makes possible the realiza- 
on of this last operation more ac- 
curately and much more easily than 
was previously possible. 


IMPROVING INTERPRETABILITY 


The following discussion of improv- 
ing interpretability will be confined 
to the improvement of measures of 
unidimensional internal consistency, 
and within this general area, the 
primary emphasis will be on im- 
proving measures of Guttman scal- 
ability. We shall first consider the 
effects of adjusting marginals on some 
common item-to-item indices. Next 
the effects of such adjustments on 
scales consisting of any number of 
items will be examined, and, after 
that, we shall consider the effects of 
such adjustments on measured rela- 
tionships between a single item and 
a scale. Finally, we shall reconsider 
the topic of item-to-item indices in 
the light of the intervening discussion. 


Item-to-Item Internal Consistency 


It is well known that the phi 
coefficient, ¢, which is a Pearsonian 
correlation computed directly from a 
2 X 2 table, is greatly affected by the 
marginal proportions. The phi coeffi- 
cient will be greatest, numerically, 
when both items are dichotomized at 
the same point, and the greater the 
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difference in the two cutting points, 
the greater the dropin¢. To see how 
substantial the drop may be in some 
cases, consider the following example. 
A 6X6 scatter diagram was con- 
structed with normally distributed 
frequencies along both margins, which 
yielded a value of .60 for @ when 
reduced to a doubly response-balanced 
2X2 table. When one marginal 
proportion was raised one standard 
deviation to .84, while the other was 
held at .50, @ dropped to .44. An 
upward shift of one marginal by two 
standard deviations, to .97, reduced 
to ¢@ to .18. Shifting one cutting 
point up and the other down, both by 
one standard deviation, gave a value 
of @ of .12, while for a shift of two 
standard deviations in opposite direc- 
tions, @ was reduced to .03. More 
generally, as Cronbach (1951) has 
shown, mismatching of item marginals 
will not change ¢ appreciably when 
the degree of association is low, as 
measured by the tetrachoric correla- 
tion coefficient, but for a high degree 
of association, such discrepancies in 
marginals will lower @ markedly. 
Moral: Don’t compute phi coefficients 
without first eliminating any dis- 
crepancy in the proportions of en- 
dorsement, unless a measure is wanted 
which is lowered both by differences 
in item content and differences in item 
popularity or difficulty. 

Tetrachoric correlations, of course, 
should not be systematically biased by 
such lack of uniformity in marginals— 
for the same product-moment correla- 
tion is being estimated in all cases— 
but very large or very small propor- 
tions, say above .90 or below .10, will 
usually cause some cell frequencies to 
be quite small. This will tend to 
decrease the reliability of the tetra- 
choric estimates, and this unreliability 
will be greater, the smaller the sample. 
For a given sample size, response- 
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balancing items will tend to increase 
the reliability of the  tetrachoric 
correlations.* 

Although the question of the Gutt- 
man scalability of a pair of items is 
seldom considered, it is easy to see 
that any discrepancy in the two item 
popularities will operate to increase 
interitem reproducibility. Guttman’s 
measure of scalability is the coefficient 
of reproducibility (Guttman: 1944, 
1950) which is an indicator of the 
accuracy with which respondents’ 
exact response patterns can be pre- 
dicted from a knowledge of total 
scores. The condition for perfect 
reproducibility in the case of two 
items is the complete absence of 
respondents who endorse the less 
popular item but do not endorse the 
more popular one. Such patterns of 
responding are referred to as nonscale 
patterns. If both items have exactly 
the same popularity, reproducibility 
will be at a minimum; the knowledge 
that a respondent endorsed exactly 
one item does not allow a better-than- 
chance prediction of which of the two 
items was endorsed. At the other 
extreme, if one item is of very high 
popularity and the other of very low 


* Another basis on which ¢ coefficients and 
tetrachoric correlations have often been com- 
pared is their relative appropriateness for the 
factor analysis of items. The use of ¢ coeffi- 
cients has the disadvantage of introducing 
so-called ‘“‘difficulty factors," due to the 
variations in correlations caused by differ- 
ences in item difficulty or popularity levels. 
Tetrachorics, which avoiding this complica- 
tion, may form a non-Gramian correlation 
matrix, which would violate the basic assump- 
tions of factor analysis. The MCM fre- 
quently makes possible the use of either 
¢@ coefficients or tetrachorics computed from 
items which have been equated on popularity 
or difficulty, and thes appear to be the best 
alternatives when available. Robert E. 
Krug and the author are currently conducting 
an investigation of the effect of response- 
balancing items on the outcome of factor 
analysis. 
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popularity, the chances for nonscale 
responses to appear in any quantity 
will be quite low. 


Internal Consistency of Scales 


We now turn to the scalability of 
sets of any number of items. From 
Guttman’s criterion of scalability, one 
may formulate the following definition 
of a perfect scale: All respondents 
endorsing the least popular item also 
endorse all other items, all respondents 
endorsing all but one of the items do 
not endorse the least popular item, 
all respondents endorsing all but two 
of the items do not endorse either of the 
two least popular items, etc., up to those 
respondents who do not endorse even 
the most popular item and therefore 
none of the others. A perfect Gutt- 
man scale of four items may be 
diagramed as in Fig. 2, where the area 
within the rectangle represents the 
entire sample, the area within the 


largest circle represents that group of 
respondents endorsing the most pop- 


ular item, etc. Thus, a_ perfect 
Guttman scale may be thought of 
geometrically as a series of perfectly 
nested subsets. A case of imperfect 
scalability is shown in Fig. 3. ‘The 
shaded area represents those respond- 


Fic. 2. Geometrical representation of a 
perfect Guttman scale of four items 
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Fic. 3. Geometrical representation of an 
imperfect Guttman scale of four items. 
Shaded areas indicate respondents whose 
response patterns do not fit one of the perfect 
scale patterns. 


ents which do not exhibit perfect scale 
response patterns. 

It is easily seen that any measure 
of scalability based on Guttman’s 
criterion must take into account the 
distribution of item marginals, as well 
as the number of items, for the like- 
lihood of approximating perfectly 
nested subsets with any specified 
degree of perfection is much enhanced 
by marginal frequencies which are 
highly dispersed rather than clustered 
near the middle. The coefficient of 
reproducibility is given by 


Rep =1-p), [1] 


where ~, is the proportion of the total 
number of responses which are errors, 
an error being counted each time a 
respondent endorses a less popular 
item and fails to endorse a more 
popular item. From this it follows 
that the maximum number of errors 
which may be contributed by a 
dichotomous item is proportional to 
the proximity of its popularity value 
to .50. An item with a popularity 
value of .50 could contribute as many 
as N/2 errors, where N is the sample 
size. For several dichotomous items, 


with popularities p;, as few as NEmin 
(ps, 1 — p,) scale errors could occur, 
where min(p;, 1 — p,) refers to the 
smaller of the two values within 
parentheses. The more the 9p; differ 
from .50, the higher will be the mini- 
mum value of Rep; furthermore, the 
probability of obtaining any given 
degree of reproducibility by chance 
alone will likewise be higher. Gutt- 
man recognized this problem, but his 
solution consisted of rules of thumb 
(Guttman, 1950) which led several 
later writers to express dissatisfaction 
with the coefficient of reproducibility. 
Additional disadvantages of Rep are 
that its general sampling distribution 
is unknown,’ it fails to utilize all the 
available information, it is not in- 
dependent of the number of items, and 
it relies on inspectional methods which 
allow capitalization on random errors. 
For a recent review of other efforts to 
develop satisfactory measures of re- 
producibility and_ scalability, see 
White and Saltz (1957). ws 

All of the more satisfactory methods 
of estimating scalability reviewed by 
White and Saltz, as well as some not 
reviewed (Borgatta, 1955; Menzel, 
1953; Willis, 1954), involve adjusting, 
in one way or another, for some of the 
factors which the coefficient of re- 
producibility fails to take into account 
—most particular, the distribution of 
item marginals. An alternate ap- 
proach to the problem, and one which 
is to be preferred, at least until the 
sampling theory of scale errors has 
been much further developed’, is to use 


Green (1956) has recently provided an 
approximate formula for SEpep, as well as two 
easily computed approximate formulae for 
Rep, while more recently Goodman (1959) 
and Sagi (1959) have contributed to the 
sampling theory of the coefficient of repro- 
ducibility. 

* And perhaps even after, for reasons to be 
discussed. See the example devised by 
Loevinger (1947), given below. 
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the MCM to adjust the item mar- 
ginals—rather than adjusting for 
them. A particularly important and 
simple case is that of response- 
balancing, in which item marginal 
proportions are set equal to .50. If 
we represent the response-balanced 
case by a diagram such as Fig. 2 and 
Fig. 3, all circles will be of exactly the 
same size and, furthermore, each circle 
will occupy one half the area of the 
rectangle. There is no possibility for 
nesting of sets. The only way in 
which perfect scalability could occur 
is for all circles to coincide exactly, a 
very exacting condition. This would 
occur only if half of the respondents 
endorse all the items and the remain- 
ing half of the respondents endorse 
none of the items. If all items tap the 
same area of content, they can differ 
only in their popularities, which will 
be determined by strength of wording. 
If we remove all differences in pop- 
ularity, each respondent will, except 
for random error, answer each ques- 
tion in the same way. 

Setting all item marginals equal to 
.50 maximizes the opportunity for 
error responses to occur; all spurious 
inflation is removed and the measured 
scalability will assume its minimum 
or basal value. Consequently, the 
scalability observed under conditions 
of response-balancing will be referred 
to as the basal scalability of the set of 
items. As will be shown in a forth- 
coming article, the expected value for 
the proportion of responses which are 
error responses, under conditions of 
response-balancing, is given by . 


1 1 f/2x-1 
E(p) = 5 ), [2] 


where the number of items in the 
scale, k, is equal to either 2x or 2x + 1. 
E(p~.) asymptotically approaches a 
limit of one half as x increases without 
bound. Fig. 4 shows the expected 


N 


EXPECTED REPRODUCIBILITY 
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Fic. 4. Expected value of the coefficient 
of reproducibility as a function of uniform 
item marginals and number of items, k. 
(Rk equals either 2x or 2x + 1.) 


value for the coefficient of reproduci- 
bility, E(1 — p,.), as a function of 
uniform item popularity and number 
of items. 

There are at least two reasons why 
it is preferable to measure basal 
scalability. First, error responses are 
given every opportunity to appear. 
There is no possibility for them to 
“hide behind” large or small mar- 
ginals. Second, indices of basal scala- 
bility can be directly compared from 
one scale to another, whereas this is 
not possible with indices of scalability 
which are affected by the distribution 
of item marginals. A third reason, 
which might be mentioned, is that 
response-balancing maximizes the 
number of discriminations made 
among respondents by each item. 

Let us now consider the problem of 
devising a suitable descriptive statistic 
for indicating the basal scalability of 
a se: of items. White and Saltz 
(1957) have proposed four criteria 
against which any index of scalability 
may be evaluated. Their four cri- 
teria, in their words, are: 
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it yield a theoretical maximum 
value which is the same for any test? 


Does 


Does it yield a theoretical minimum 
value which is the same for any test? 


Does it permit evaluation of the null 
hypothesis that the obtained reproduci- 
bility index is not significantly different 
from chance? 

. Does it permit evaluation of each item 
in the test as well as of the test as a 
whole? 


The index to be proposed presently 
is related to Borgatta’s (1955) error 
ratio statistic. It is, therefore, ap- 
propriate to examine this statistic, 
and to evaluate it against the above 
four criteria. Borgatta defines the 
error ratio as “the ratio of errors in 
the scale to the maximum possible 
number of errors for a scale of the 
same marginal frequencies.” By 
“maximum possible errors’’ is meant 
the number of errors generated by 
complete independence among items, 
under the assumption that all items 
are properly ordered with respect to 
item popularities and that all items 
are scored in the direction yielding 
fewest errors. Thus, the error ratio 
is, more precisely, the ratio of the 
number of observed scale errors to the 
expected number of errors under the 
null hypothesis of no item covariation. 
The difference between the expected 
number of errors and the minimum 
possible number will not be large for 
a scale of many items, but for a scale 
of only a few items, the difference will 
be appreciable. 

The error ratio has a theoretical 
maximum value of unity for any test, 
which obtains when items are com- 
pletely independent, so that it meets 
the first of the White and Saltz 
criteria. It could happen, especially 
with a small number of items, that the 
mean interitem covariation would be 
slightly tiegative (Willis, 1959), in 
which case the error ratio could 
exceed unity, but this would generally 


imply that the items were not keyed 
in a consistent manner relative to one 
another. By reversing the direction 
in which some items are keyed, it will 
always be possible to attain a non- 
negative mean interitem covariance. 
Assuming this condition to be met, 
and assuming proper ordering of items 
by popularity, the error ratio will in 
every case possess an upper bound of 
unity. Likewise, the error ratio 
meets the second criterion, for with 
any test perfect scalability yields an 
error ratio of zero. Borgatta does not 
consider the problem of testing for the 
significance of error ratios, nor does 
he present an index for evaluating 
individual items. The error ratio 
statistic, consequently, does not meet 
Criteria 3 and 4. 

A further disadvantage of the error 
ratio may be pointed out. Because 
it allows a range of item marginals, it 
may indicate a high degree of scala- 
bility when little or no valid scalability 
is actually present, as an example 
constructed by Loevinger (1947) illu- 
strates nicely. She imagined the case 
of an ability test consisting of 10 
items, each measuring a different 
ability, and each item, furthermore, 
being at a different grade level of 
difficulty. If the test is given to 10 
students, each being an average 
student, one from each of 10 grades, 
high apparent scalability (‘‘homo- 
geneity”’ in her terminology) would 
result. Accidental, but systematic, 
relationships between item content 
and item popularity can also produce 
such effects in attitrde scales. For 
this reason, the error ratio should only 
be used to compare the scalability of 
scales with similar distributions of 
marginals, as Borgatta clearly pointed 
out.’ The same restriction applies to 


? The error ratio also presents difficulties 


from a computational standpoint. Except 
when the number of items is quite small, 
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the coefficient of reproducibility as 
well. It is preferable, therefore, to 
obtain the additional data required 
to response-balance items, and to 
employ some index of basal scalability. 

What index of basal scalability 
should be employed? One possibility, 
of course, is to retain the error ratio, 
but stipulate that it is to be computed 
from response-balanced data. A sec- 
ond possibility, which seems to the 
author to be more satisfactory, is to 
define a new statistic equal to unity 
minus the error ratio as computed 
from response-balanced data. This 
would possess the advantage of going 
in the “right” direction, that is, the 
more scalable the items, the larger the 
value of the coefficient. (Borgatta 
preferred to run the error ratio in the 
other direction, feeling that there 
would otherwise be the danger of 
confusion between error ratios and 
reliability coefficients.) Define the 
coefficient of basal scalability, Sp, as 
follows : 


1 — R./E(R.), [3] 
where R, is the observed number of 
error responses, after the items have 
been response-balanced; and E(R,) is 
the expected value of R., computed by 
multiplying the value of E(p,.) found 
by Equation 2, by the total number 
of responses, Vk. R, is readily ob- 


Sz = 


computing the expected number of scale 
errors is an enormous task. The procedure 
involves determining the expected frequency 
of each possible response pattern under the 
assumption of complete independence of 
items. For each such pattern the expected 
frequency is multiplied by the number of 
error responses contained in the pattern. 
The sum of these products yields the expected 
number of errors. Even for the relatively 
uncomplicated case of 10 dichotomous items, 
the number of response patterns equals 1024, 
and the method becomes unfeasible. As will 
be shown, major simplifications can be made 
for the case of dichotomous items with 
uniform marginals, 
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tained from the observed distribution 
by means of the following formulas 


) 
t(fitfe-i), (k =2x+1) 


[4] 


where f; is the number of respondents 
endorsing exactly 7 items. 

It is readily seen that Sz satisfies 
the first two criteria of White and 
Saltz, for in all cases its upper bound 
is unity and its lower bound is zero,® 
corresponding to perfect scalability 
and the complete absence of scala- 
bility, respectively. Criterion 3 is 
also met, for the score distribution 
observed after response-balancing 
may be tested against the theoretical 
binomial distribution of scores which 
would be obtained with complete 
independence of items. The greater 
the scalability, the more platykurtic 
will be the observed distribution of 
scores; extremely high scalability 
would result in a U shaped distribu- 
tion, with the majority of scores piling 
up at the two extremes. Either a chi 
square test of goodness-of-fit, or the 
Kolmogorov-Smirnov test can be used 
as the test of significance, the latter 
being preferable because of its pre- 
sumably greater power (Siegel, 1956). 
The same tests may also be used to 
test the significance of the difference 
in scalability between two scales 
composed of an equal number of items. 

Guttman (1944, 1950) recommended 
a cut-off point of .90 for his coefficient 
of reproducibility, separating scales 
from -‘quasi-scales’ and nonscales. 
It has been customary for workers in 
this area to abide by this dichoto- 


* The same assumptions must be made to 
assure a lower bound of zero for Sg as were 
required to establish an upper bound of-unity 
for the error ratio. 


i 
| 
q 
gat 
4 
1 
4 
‘ 
a 
i 
¥ 


42 


mization, arbitrary as it is, or to sub- 
stitute one of their own, e.g., Menzel’s 
(1953) cut-off point of .60-.65 for his 
coefficient of scalability. Although 
the value of such arbitrary cut-off 
values seems rather limited, we may 
establish such a value for Sg—perhaps 
a less arbitrary one—by requiring that 
the number of observed errors, R., be 
no greater than half the expected 
number, E(Re). This condition im- 
plies Sg > .50. Despite its numeri- 
cally lower value, the cut-off point 
represents a higher degree of scala- 
bility than either of the above-men- 
‘tioned cut-off points, and will probably 
not be attained very often in practice. 
It seems worthwhile, therefore, to 
establish a second, less_ stringent 
condition, such as 3Re < 2E(Re), 
which implies Sp, 2 .33. The Sz 
values of .33 and .50 may be referred 
to, respectively, as the first level and 
the second level of scalability. 


Item-to-scale Internal Consistency 


White and Saltz’ last criterion may 
be met by developing appropriate 
item-to-scale indices. We shall as- 
sume that the items in the scale have 
been respo.ise-balanced, as has the 
item which is to be evaluated against 
the scale. Consider Table 1, which 
shows scale scores plotted against 


‘ TABLE 1 


TABLE OF FREQUENCIES REQUIRED 
FOR COMPUTING A AND A’ 


Item Score 

Scale —- 
Score Fail Pass 

k fi’ fe 

2 fi fs 

1 fi’ 

0 fol fo 

N/2 N/2 
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item scores (+ or —). Each f; in- 
dicates the number of respondents 
with score ¢ on the scale that endorse 
the item in question, while f,’ in- 
dicates the corresponding number of 
respondents that did not endorse the 
item. If the scale possesses perfect 
scalability, f.+ fo. = N/2, and 
fi + fi’ = N/2. If, in addition, the 
item classifies every respondent in 
agreement with the scale, f, and f;’ 
will both be zero, so that f,’ + fi; = N. 
If the item does not classify all re- 
spondents in accordance with the 
results from the perfect scale, the 
ratio (f.’ + fi)/N will give the pro- 
portion of respondents on which there 
is item-to-scale agreement. Under 
conditions of imperfect scalability, an 
appropriate measure of item-to-scale 
agreement is the number of respond- 
ents passing the item that are above 
the scale median, plus the number of 
respondents failing the item that are 
below the scale median, all divided 
by N. Formally, 


Mdn k 
fi + 2 fi 
A= [5] 
where A is the item-to-scale coefficient 
Mdn 


of agreement, is the number of 
0 
respondents failing the item that are 


k 
below the scale median, and >> f; is 
Mdn 


the number of respondents passing 
the item that are above the scale 
median. The coefficient A equals the 
proportion of respondents on which 
scale and item agree. 

Before A can be computed, it is 
necessary to reduce Table 1 to a 
2 X 2 table. The problem here is to 
determine cell values in such a way 
that preserves response-balancing and 
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at the same time accurately reflects 
the degree of association. The fol- 
lowing relationships, which must hold 
when both item and scale are response- 
balanced, may be employed for 
achieving this end: (a) not only must 
the plus and minus item-responses be 
equal, but the same must be true for 
the plus and minus scale-responses, 
and (b) the 2 X 2 table must be 
symmetrical—the upper left cell fre- 
quency must equal the lower right 
cell frequency, and the upper right 
cell frequency must equal the lower 
left cell frequency. Because it will 
often happen that the cases falling at 
the scale score category within which 
the median-cut must be made can be 
assigned in more than one way so as 
to meet the above two conditions, a 
supplementary assignment criterion 
is needed. An obvious and logical 
one is to require that the sum of the 
cases assigned to the upper left and 
lower right cells be as nearly as pos- 
sible in the same ratio to the sum of 
the cases assigned to the upper right 
and lower left cells as the correspond- 
ing sums of cases already assigned to 
the two diagonals. It is not as com- 
plicated as it may sound, and is easily 
done in practice. 

Because A gives equal weight to 
each case, and because it only dis- 
tinguishes between cases above and 
below the scale median, it is possible 
even for an imperfect scale to show 
perfect agreement with an item. By 
introducing a _ simple differential 
weighing system, a coefficient of item- 
to-scale agreement may be devised 
which takes into account the degree of 
scalability of the scale. Let f,’ and 
fx be assigned weight k, let f;’ and 
fe-1 be assigned weight k — 1, etc., 
down to f,’ and fiz which are as- 
signed weights k — x, where k = 2x 
or k= 2x+1. The modified coeffi- 
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cient of agreement is then given by 


Mdn 
(k — + fs 
kN [6] 


the notation being the same as that 
used previously. Except in the limit- 
ing case of perfect scalability, A > A’, 
for A’ is attenuated by the presence 
of scores other than zero or k, whereas 
A is not affected by the distribution 
of scores, except insofar as this de- 
termines which cases are above and 
which cases are below the median. 
One may think of A as A’ after 
correction for attenuation. 

The coefficient A can also be com- 
puted on a scale-to-scale basis by 
dichotomizing both scales at the 
median, and the same is true for A’. 
The computation of scale-to-scale 
values of A’ would involve a double 
usage of the differential weighting 
system, so that each cell frequency 
would be multiplied by two weights, 
one for each of the two scale scores. 


A’ = 


Item-to-Item Internal 
under Conditions 
balancing 


An appropriate item-to-item index 
of basal scalability can be obtained 
by setting k = 2 in the formula for Sz. 
From Equation 2, we find that 
= N/2. From Equation 3 it follows 


that 
[7] 


Consistency 
of  Response- 


= 1 — 2R./N, 


where S,(2) is the basal scalability for 
a pair of items, and R, is the number 
of error responses occurring in the 
2 X 2 table after items have been 
response-balanced. R, is the number 
of respondents that endorse exactly 


one of the two items. S,(2) ranges 
from + 1 to — 1, and assumes a value 
of zero when items are completely 
independent. 
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Spe(2) is very closely related to 
Loevinger’s (1947, 1948) item-to-item 
coefficient of homogeneity, H;;, the 
only difference being that items are 
not response-balanced prior to com- 
putation of H;;. In computing Hj, 
only those cases in one cell of the 
2X2 table are considered to be 
discrepant, viz., those individuals who 
pass the harder (endorse the less 
popular) item and fail the easier (do 
not endorse the more popular) item. 
For items with tied item marginals, 
one would presumably continue to 
count but one cell value as K, the 
number of discrepant cases. If both 
item marginals are .50, H/;; reduces to 
1 —4K/N. Furthermore, K will be 
equal to R,/2, since R, includes both of 
the equal-sized cell frequencies which 
are discrepant. Thus, if H;; is com- 
puted on response-balanced items, 
= Sg(2). Loevinger (1948) has 
shown that H;, is identically equal to 
the ratio of ¢ to the maximum possible 
value of ¢ for the given item mar- 
ginals. Since this maximum value of 
@ is unity for equal item marginals, 
Sp(2) is equal to ¢. 


THe METHOD OF CONTROLLED 
RESPONDENT MARGINALS 


It is also possible to make adjust- 
ments with respect to respondents 
rather than items. This “transpose”’ 
of the MCM, which employs the same 
kind of data but which analyzes it 
differently, has been named the 
Method of Controlled Respondent 
Marginals (the Nor- 


® The possibiuty was considered of naming 
the technique the Method of Controlled 
Scores, but this was rejected in favor of the 
name given, one reason being that the term 
score carries a strong connotation of measure- 
ment of quantitative individual differences, 
making the term seem inappropriate in the 
absence of variation about the mean. 

%” Special problems are encountered with 
the MCRM whenever the number of items in 


Ricuarp H. WILtIs 


mally, the purely qualitative pattern- 
ing of responses given by one respond- 
ent cannot be directly compared with 
that given by another, unless both 
happen to have very nearly the same 
score, for the response pattern of a 
respondent is a function of his total 
score. It seems likely that the most 
frequent use to which the MCRM 
will be put will be to eliminate differ- 
ences in total scores, so that the 
purely qualitative differences in re- 
sponse patterns may be analyzed. 

If respondents agree with one 
another exactly in the ranking of items 
with respect to strength of agreement, 
but display individual differences in 
means, variances, and exact shapes of 
their response distributions, then set- 
ting all respondents marginals equal 
will result in identical response pat- 
terns for all respondents. All re- 
spondents would be scored the same 
way on each item. Furthermore, 
just as response-balancing items by 
means of the MCM allowed a basal 
measure of scalability to be computed 
for the set of items, response-balancing 
respondents by means of the MCRM 
allows one to observe the basal simi- 
larity existing among the group of 
respondents. That component of 
similarity due to respondents’ tend- 
encies to agree or disagree with items 
in general has been removed. If 
respondents on the average endorse 
90% of the items before response- 
balancing, and particularly if each 
respondent endorses exactly 90% of 
the items, response patterns would 
necessarily show a great deal of simi- 
larity. The MCRM allows these 
effects to be eliminated, just as the 
MC™M allows the elimination of spuri- 
ous scalability due to nonuniform 


the scale is odd. Because the added com- 
plexities in no way change the basic principles, 
it will be assumed that k = 2x throughout the 
discussion of the MCRM: 
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item marginals. Minimizing  simi- 
larity of response patterns is, of course, 
equivalent to maximizing qualitative 
interrespondent differences. 

If patterns of endorsement are 
identical for all respondents, after 
differences in level of endorsement are 
removed, half of the items will be 
unanimously endorsed while the re- 
maining half will be endorsed by no 
respondents. At the opposite ex- 
treme, with maximum dissimilarity 
among response patterns, no item 
would be endorsed more frequently 
than any other.’ That is, all items 
would have marginals of .50. A 
sufficient but not necessary condition 
for obtaining this minimum degree of 
pattern similarity is that of complete 
independence among respondents, so 
that all possible patterns become 
equally likely. In all cases, no matter 
what the degree of pattern similarity, 
the sum of the marginal frequencies 
will equal NR/2. 

The fact that response-balancing by 
respondents allows one to obtain an 
indication of the basal similarity or 
consensus among respondents, just as 
response-balancing by items allows 
measurement of basal scalability of 
items, suggests that a measure some- 
what analogous to Sg can be devised 
for the measurement of basal con- 
sensus. Consider the fact that in- 
dividual item marginal frequencies 
must equal N/2, on the average, and 
that the degree of consensus is re- 
flected by deviations of the marginal 
frequencies about this mean. Thus, 
although the algebraic sum of devia- 
tions about the mean must equal zero, 
the sum of the absolute magnitudes 
of the deviations will be proportional 
to the degree of consensus present. 
All that is lacking is the proper scale, 
and this can be achieved by dividing 
by the maximum possible value. If 
half the items have marginal fre- 
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quencies of N, and half have marginal 
frequencies of zero, each will deviate 
from the mean of N/2 by the amount 
N/2. The maximum possible sum of 
absolute deviations about the mean 
is therefore equal to k-N/2 = Nx. 
Thus, 

_ N/2| 
Nx ’ [8] 


where Cg is the coefficient of basal 
consensus among respondents, and the 
f; are the item marginal frequencies. 
Cg ranges between zero and unity. 
It is a measure of the qualitative, or 
configural, similarity among respond- 
ents which remains after that com- 
ponent of similarity has been removed 
which is due to respondents’ tend- 
encies to agree or disagree with items 
in general. 

A second possible measure of basal 
consensus is the ratio of the variance 
of item marginals to the maximum 
possible variance of item marginals: 


Hr = V/Vinax; [9] 


where Hz is the coefficient of respondent 
homogeneity. This index is related by 
its underlying logic (but not by its 
intended function) to Loevinger’s 
coefficient of test homogeneity, H;. 

Whatever descriptive statistic may 
be employed, analysis of variance may 
be used to test the significance of the 
deviations of item marginal fre- 
quencies about their mean, and thus 
the degree of basal consensus. The 
fact that scores are exclusively 1’s and 
0’s (corresponding to plus and minus 
responses, respectively) simplifies 
computations enormously. In light 
of the known “robustness” of anova 
against violations of its assumptions 
(Norton, as reported in Lindquist, 
1953, pp. 78-90) one need not worry 
excessively over the violation of the 
assumption of normality, except in 
cases of borderline _ significance. 
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There are, however, two special 
features about this use of analysis of 
variance. First, constraining the re- 
sponse pattern of each respondent to 
contain exactly k/2 pluses (or 1's) 
reduces the degrees of freedom per 
respondent from k to k/2, for deter- 
mining the k/2 items evoking plus 
responses also determines the re- 
sponses evoked by the &/2 remaining 
items. Thus, with & items and N 
respondents, the total number of 
degrees of freedom will be }Nk — 1, 
and the F ratio will have k — 1 and 
4}Nk — k degrees of freedom. A 
second peculiarity is that this seems 
to be one place in which a two-tailed 
test of significance is appropriate with 
analysis of variance, for an F ratio 
less than one is not meaningless. If 
1/ F were significantly larger than one, 
this would indicate a significant degree 
of basal disagreement among re- 
spondents. //,z can also be tested for 
significance by an F test given by 
Jardine (1958). (As used by Jardine, 
this test indicates the significance of 
the scalability, or something closely 
akin, of an instrument.) 

Ce or He provides an over-all 
indication of the configural similarity 
of the group. Additional useful in- 
formation can be obtained by com- 
puting the frequency with which each 
pattern occurs, or at least doing this 
for the most frequently occurring 
patterns. To show what might hap- 
pen in an extreme case, assume all 
item marginal frequencies were ob- 
served to be exactly N/2. If N were 


yome multiple of (*), which is the 


number of possible response patterns, 
the rectangular distribution of mar- 
ginals could have been produced by 
the occurrence of au patterns equally 
often. On the other hand—and here 
it is only necessary that NV be even— 
the rectangularity of the marginals 
could be due to the division of the 


sample into two equal sized sub- 
groups, each with its own pattern of 
responding which is in direct opposi- 
tion to that of the other. This is the 
case of complete consensus within 
subgroups and complete disagreement 
between subgroups. The extent to 
which disagreement in a group consists 
of opposing components of subgroup 
consensus, as opposed to the equally 
often occurrence of all response pat- 
terns, may be referred to as opinion 
or attitude polarization. 

Next, consider the problem of 
assessing the degree of configural 
consensus between pairs of respond- 
ents. With dichotomous items, 
there are generally 2* possible patterns 
of responses a respondent can make. 
Under the constraint of response- 
balancing by respondents, this number 


is reduced to (®). Any pair of re- 


sponse patterns can be made to 
generate a congruence pattern. This 
is done by using a plus to indicate 
agreement on a particular item, and a 
minus to indicate disagreement. For 
example, if Respondent A’s response 
pattern on a 10 item scale were 
and 
Respondent B’s response pattern were 
(—-+-—++ — + — +), thecor- 
responding congruence pattern would 
be (+ + + + + + — + + 
Congruence patterns can range from 
all pluses to all minuses, but the 
number of pluses (and minuses) will 
always beeven. The total number of 
distinct congruence patterns which 
can arise is equal to 2*~". 

Several simple and _ satisfactory 
descriptive indices of pattern con- 
gruence can be listed, such as 


1. The proportion of items on which 
the two respondents agree. 

2. The proportion of agreements 
above chance, the ievel of chance of 
agreement being 50%. 
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3. The phi coefficient between the 
two response patterns. Using Guil- 
ford’s (1950, p. 341) simplified for- 
mula, which simplifies even further, 
since both splits are even, we have 


gas = — P_ [10] 


where P, is the proportion of items on 
which the two respondents agree, and 
P_ is the proportion on which they 
disagree. 

The simplest way to test for the 
significance of the congruence is to 
compute the chi square value asso- 
ciated with the 2 X 2 table which 
gave rise to gas. An alternate 
method is by means of the binomial 
distribution and the fact that, for 2y 


) equally likely 


l 
pluses, there are 2y 


congruence patterns. For the ex- 
ample given above, x? = 3.6, df = 1, 
which falls short of significance at the 
.05- level. This is confirmed by the 
binomial test. Of the 2° = 512 pos- 
sible congruence patterns, 45 contain 
8 pluses, and one contains 10 pluses. 
The proportion 46/512 equais .09, 
indicating that the probability of ob- 


taining at least 8 agreements between , 


two response patterns by chance alone 
is about 9 times out of a hundred 
Thus, for as few as 10 items, oniy pairs 
of respondents who agree unanimously 
can be said to exhibit a significant 
degree of congruence in their response 
patterns. Scales with larger numbers 
of items are required for a sensitive 
test of significance. ' 

It is anticipated that the MCRM 
will find application in clinical psy- 
chology, where it would contribute to 
profile analysis (Cattell, 1949; Cron- 
bach & Gleser, 1953; Gaier & Lee, 
1953) and Q-technique (Stephenson, 
1953) ; and in social psychology where 
it could be used to analyze both 
intragroup and intergroup similarities 
(McQuitty, 1956). It should also be 


applicable to selection and placement 
procedures. In the terminology of 
profile analysis, the effect of the 
MCRM is to remove individual differ- 
ences in elevation. As Cronbach and 
Gleser point out, sometimes elevation 
differences are important (as in apti- 
tude testing) and sometimes such 
differences merely reflect the operation 
of individual differences in response 
sets, and thus should be eliminated. 
It is therefore suggested that, in the 
first case, the MCRM be used to 
supplement techniques which take 
into account differences in elevation, 
while in the second case, it should 
prove capable of standing on its own 
feet. 

Parenthetically, if an investigator 
wishes to analyze his data by the 
MCRM only, without making use of 
the MCM, the data-collection pro- 
cedures can be suitably modified to 
simplify the task. Respondents can 
be requested to indicate directly which 
k/2 items they find most acceptable, 
possibly by sorting statements which 
have been typed onto index cards into 
two equal sized piles. And, of course, 
the method can be modified to require 
respondents to split the set of state- 
ments at some other point, or to make 
more than one split. 


Is 1r LEGITIMATE TO ADjUST 
Item MARGINALS? 


One possible objection which might 
be raised against the approach taken 
in this paper is that it is not legitimate 
to alter the popularity or difficulty 
levels of items. In evaluating the 
merits of such an objection, it is 
important to bear in mind that item 
marginals are merely arbitrary joint 
functions of item wording, the dis- 
tribution of the measured trait in the 
population, the representativeness of 
the sample, response sets, perceived 
saliency, past experience of respond- 
ents, and several other sources of 


AR 
4 
cf 7 
q 
+ 
a 
q 
a 
q 
4 
‘A 
| 


48 Ricuarp H. WILLIs 


variation which could be named. To 
give a simple illustration, the marginal 
frequencies of items on a scale of 
hostility towards one’s parents would 
undoubtedly be changed considerably 
if skillfully worded reminders were 
added to the instructions to the effect 
that one should think nothing bad 
about one’s parents. The probability 
that a _ particular individual will 
endorse an item would be a joint 
function of his hostility towards his 
parents and his susceptibility to the 
persuasion. Clearly, the unadjusted 
marginal of an item does not locate a 
nonarbitrary point on the scale, even 
if all items have highly similar con- 
tent. Nor does it characterize the 
item itself in any sense, except with 
respect to a host of externai factors. 
Even more untenable is the assump- 
tion that the popularity of an item 
establishes a nonarbitrary zero point 
so that one can say that respondents 
agreeing with the item have positive 
feelings about the issue and those 
disagreeing with the item have nega- 
tive feelings. It is relevant here to 
note that studies employing the 
intensity analysis technique of Gutt- 
man (1947) and Suchman (1950) have 
shown that the proportion of respond- 
ents holding a favorable attitude 
towards an issue may be either larger 
or smaller than the proportion in- 
dicating agreement with the state- 
ment, depending upon the degree of 
severity with which the statement is 
worded. If some purpose should be 
served, the MCM technique can be 
used to equate these two proportions 
so that a positively scored response 
would imply positive affect and a 
negatively scored response would 
imply negative affect. 

Relative item marginal proportions 
are affected by fewer extraneous 
factors than the absolute proportions, 
and so are less arbitrary, since several 
possible factors might raise or lower 


the mean of a set of marginals without 
disturbing their rank order. For this 
reason, one will usually wish to pre- 
serve the original ranking of items 
when adjusting marginals, to the ex- 
tent that this is allowed by the 
specifications for the desired distribu- 
tion of marginals. 

There is also considerable arbitrari- 
ness in the change in the marginal of a 
single item administered on two occa- 
sions. Such a change will, of course, 
be completely arbitrary unless the 
sample is substantially the same on 
both occasions, but even if this condi- 
tion is met, it is not generally possible 
to determine the relative effects of 
such factors as true changes in the 
distribution of the trait in the sample, 
memory effects due to answering the 
items previously, changes in response 
sets, changes in conditions of adminis- 
tration, etc. Because of this, it will 
be quite legitimate to choose one 
cutting point for an item on one 
occasion and to choose a different 
cutting point for the same item on 
another occasion, as one might be 
required to do if, for example, it was 
desired to hold the marginal propor- 
tion constant from one occasion to the 
second. This is not to say that 
observed changes in marginal propor- 
tions are without meaning and should, 
therefore, be ignored. Depending 
upon the assumptions which seem 
justifiable in a given situation, such 
changes in item marginals may have 
clear-cut and important meaning. 
This is to say, however, that under © 
the usual conditions of data collection, 
observed shifts in item marginals will 
have considerably less than perfect 
interpretability, and that a_ useful 
purpose may often be served by hold- 
ing the marginal of an item constant 
over occasions, or by maintaining the 
same distribution of marginals of a 
set of items from one occasion to 
another. The possibility is not ruled 
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out that it may sometimes be ad- 
vantageous to consider the data both 
in its unadjusted form and in one or 
more adjusted forms. 

One of the best arguments favoring 
adjustment of marginals, furthermore, 
is the fact that the method has been 
found to be highly satisfactory in 
practice. Fifty multiple-response 
items of the type previously described 
were administered to a sample of 100 
University of Wisconsin undergrad- 
uates. About a third of the group 
took the questionnaire again four days 
later, for the purpose of ascertaining 
the test-retest reliability. This was 
found to be equally high after re- 
sponse-balancing as before. At no 
time did a respondent indicate that 
the task was difficult or unpleasant. 
In fact, the impression received was 
that the addition of the graphic scales 
made the task easier and more mean- 
ingful for most respondents, as it 
allowed greater flexibility in respond- 
ing. In analyzing the data, it was 
found that the computations required 
to response-balance the data were not 
especially time-consuming, and were, 
in addition, of the sort which can be 
performed by unskilled clerical help. 
The detailed empirical findings will be 
reported in a subsequent article. 


SUMMARY 


A general and flexible method was 
presented for the manipulation of the 
marginal frequencies, or popularity 


values, of attitude items. The con- 
trol of item marginal frequencies is 
achieved by obtaining an essentially 
continuous distribution of responses 
to each item by means of a graphic 
rating scale. The graphic rating scale 
is usually incorporated into a multiple- 
response type of item, an item type 
which requires more than one response 
from each respondent. 

The advantages of the technique, 
which has been named the Method of 


Controlled Marginals, was discussed 
in terms of (a) its potential contribu- 
tion to the problem of improving 
prediction, and (0) its contribution to 
the problem of improving the inter- 
pretability of scale scores. In its use 
for purposes of prediction, the MCM 
can be used to achieve the distribution 
of scores which maximizes the signifi- 
cant discriminations among respond- 
ents. In connection with the problem 
of interpretability, particular atten- 
tion was devoted to the improvement, 
of the measurement of Guttman 
scalability, a measure of unidimen- 
sional internal consistency. The con- 
cept of basal scalability was developed. 
To compute basal scalability, it is first 
necessary to response-balance the 
items of the scale (set all item popu- 
larities equal to .50), an operation 
which guarantees the removal of all 
spurious inflation of scalability coeffi- 
cients due to imbalance of item 
popularities. Two item-to-scale in- 
dices and an item-to-item index of 
scalability were developed which are 
appropriate for use under conditions 
of response-balancing. 

The Method of Controlied Re- 
spondent Marginals was also dis- 
cussed. In the MCRM, which may 
be considered to be the transpose of 
the MCM, marginal frequencies are 
manipulated by respondents rather 
than by items. A response-balanced 
respondent would be one for which the 
number of plus-responses to items has 
been set equal to the number of minus- 
responses. Response-balancing all 
respondents allows observation of the 
basal similarity occurring among re- 
sponse patterns. This alse has the 
effect of maximizing qualitative inter- 
respondent differences in response 
patterns. 

The legitimacy of adjusting item 
marginals was considered, and in this 
connection, several sources of arbi- 
trariness of unadjusted marginals 
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were cited. Finally, the MCM and 
MCRM have been found to work out 
well in actual use in a study which 
will be reported in detail at a later 
time. 
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INTERVENING CONSTRUCTS—DIMENSIONS OF 
CONTROVERSY 


W. W. MEISSNER 
Woodstock College 


In recent years, an active and com- 
plex discussion has arisen over the 
methodological status of theoretical 
concepts in scientific psychology. 
The discussion has assumed note- 
worthy proportions due to the sig- 
nificance of the issues raised both on 
the level of the meaning of theoretical 
concepts and on the level of meth- 
odology. This review is intended to 
specify as much as possible the ques- 
tions raised and the lines of argument 
in which the questions are formulated. 
Once the dimensions in this complex 
field have been identified, it will at 
least be possible to determine the 
subtle influences involved and _ uiti- 
mately arrive at some sort of con- 
clusion with regard to the basic meth- 
odological issues. Our aim at the 
moment, however, must be confined 
to identifying and relating the rele- 
vant aspects of the discussion. 

We can date the controversy from 
Tolman’s (1932) introduction of the 
term “intervening variable”’ as sig- 
nifying a functional relation between 
the observable behavior of the organ- 
ism and certain predetermining fac- 
tors. These “initiating causes of 
behavior’’ were identified as heredi- 
tary endowment, past environmental 
training, the present or more or less 
recent stimuli, and the initiating 
physiological states (Tolman, 1932, 
p. 446). Tolman was careful to state 
that the intervening variable was a 
strictly logical construct which did 
not involve or depend on any experien- 
tial content. The intervening vari- 
ables could be «ither molecular or 
molar depending on whether the 


definition was couched in neurophysio- 
logical terms or in terms of behavior 
considered as molar, i.e., as concerned 
with or related to goal objects or 
means objects. The tension between 
pure functionality and identifiability 
in terms of physiological entities and 
behavior segments was unavoidable. 
Tolman’s (1935) repeated insistence 
on the nonexperiential aspect of his 
behavioral variables was not con- 
sistently reflected even in his own 
formulations. At least, these formu- 


lations left a certain methodological 
uneasiness in their wake (Kantor, 
1957). 

In his penetrating discussion of 
theoretical techniques, Koch (1941) 


pointed out that postulational and 
operational techniques had to be in- 
tegrated in any meaningful kind of 
psychological theory. The postulate 
system gains empirical relevance by 
the explicit definition of relevant. 
terms by co-ordinating definitions. 
If the operationally defined postulates 
of the formal system fit the data 
isomorphically, we can proceed inter- 
pretatively by application of formal 
deductions from the postulate system 
to the empirical domain. However, 
the isomorphic relation is hardly ever 
perfect and at times the theoretician 
is forced to proceed ‘‘telescopically”’ 
by a progressive and simultaneous 
elaboration of the formal and em- 
pirical aspects of the system. In any 
case, the variables elaborated either 
by postulation or by empirical con- 
struction are subject to a dual crite- 
rion of logical coherence and empirical 
reference (Kaplan, 1946). 
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Hull (1943) himself stressed the 
necessity of anchoring intervening 
constructs to both antecedent and 
consequent conditions. This does not 
seem to differ significantly from Tol- 
man’s conception of the intervening 
variable. However, for both Hull 
and Koch, the question of empirical 
content in the construct itself is left 
open for the moment. At least, we 
can say that the intervening variables 
are functionally related to the initiat- 
ing Causes or antecedent conditions of 
behavior, and that they mediate the 
analysis of the organism’s behavior. 
The implicit tension in Tolman’s 
formulation has begun to manifest it- 
self more clearly. Is the intervening 
variable really purely functional, or 
is it capable of identification in neuro- 
physiological terms or even molar be- 
havioral terms? It seems that Tol- 
man and others were led to insist on 
the functional character of the inter- 
vening variable because of the relative 
inaccessibility of physiological events. 
They preferred a functionally derived 
concept to one based on hypothetical 
physiological events (Melton, 1941). 
Spence (1944) was able to oppose 
Tolman and Hull's use of the inter- 
vening variable to constructs based on 
neurophysiological events, as well as 
to Lewin’s response inferred con- 
structs. 

The ambiguity was formally recog- 
nized by MacCorquodale and Meehl 
(1948) in their distinction of interven- 
ing variable and hypothetical con- 
struct. Prior to this paper, a formal 
claim was made for the pure func- 
tionality of the intervening variable, 
but at the same time its theoretical 
embodiment was colored by empirical 
reference and identification. Mac- 
Corquodale and Meehl decided that 
the intervening variable (IV) should be 
considered as a statement, all of whose 
terms were reducible to empirical 
laws ; that the validity of the empirical 


laws was a necessary and sufficient 
condition for the “correctness” of 
statements about it; and that its 
quantitative expression could be ob- 
tained without any mediating infer- 
ence by suitable groupings of terms in 
the quantitative empirical laws. Op- 
posed to this conception of the IV was 
that of the hypothetical construct (HC). 
The terms involved in the formulation 
of the HC were not wholly reducible 
to the terms of the empirical laws; the 
validity of the empirical laws was not 
a sufficient condition for the truth of 
the concept, since the concept in- 
volved meaning over and above that 
of its empirical referents (‘‘surplus 
meaning”); and finally, the quantita- 
tive expression of the concept could 
not be obtained by a simple grouping 
of empirical terms and functions. 
This formulation brought “surplus 
meaning’ with its overtones of exis- 
tential reference into clearer focus. 
Whatever confidence may be had in 
this analysis, the distinction at least 
provides a meaningful framework 
within which the various elements of 
the discussion can be situated. 

The development of the discussion 
from this point on will be organized 
under the following headings: the I1V- 
HC distinction, surplus meaning, exis- 
tential reference, definition of psycho- 
logical terms, reductive explanation, 
the nature and function of theoretical 
construction, and finally, the problem 
of personal experience in a systematic 
psychological science. Admittedly, 
these arbitrary areas overlap and in- 
teract to a considerable extent. Thus 
it must be kept in mind that each 
division has important relevance for 
every other division. They cannot be 
isolated. 


INTERVENING VARIABLES AND 
HYPOTHETICAL CONSTRUCTS 


MacCorquodale and Meehl dis- 
tinguished the IV and HC in terms 
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that implied the functional validity of 
both types of concept formation in 
scientific theory construction. Even 
though the distinction was stated 
clearly enough, a certain amount of 
confusion was still detectable in sub- 
sequent discussions of the two con- 
struct-types. At one point it is 
insisted that HCs can be conceived 
only as molar neurological -events 
(Krech, 1950), at another point it is 
stated that the IV in a scientific 
psychology is ultimately neurological 
and physiological (Hebb, 1951). In 
other contexts, the IV seems to be 
understood as any sort of symbolic 
expression which mediates between 
observable antecedent and consequent 
variables (Argyle, 1957; Koch, 1954; 
Maze, 1954). It is even stated that 
the IV is not without existential 
reference (Rozeboom, 1956). The 
confusion is manifest. Taking Mac- 
Corquodale and Meehl’s conception 
of the IV as normative, any attempt 


at designating an IV in physiological 
terms or at defining its existential 
status means that we are dealing with 


an HC of some sort. If we accept 
the IV as a mathematical function, 
then we must also resign ourselves 
to the realization that mathematical 
functions cannot be found as such in 
any organism. 

Part of the confusion undoubtedly 
arises from the difficulties of classify- 
ing even sophisticated theoretical con- 
structs. MacCorquodale and Meehl 
(1948) were careful to point out that 
the IV is wont to acquire meanings 
over and above its merely functional 
formulation. Such meanings have 
been detected in a number of specific 
concepts, such as the psychoanalytic 
terms of libido, superego, and censor 
(Flew, 1956; Frenkel-Brunswik: 1954, 
1956; MacCorquodale & Meehl, 1948). 
In Hull’s case, the mathematical for- 
mulation of the construct was un- 
equivocally functional, but other for- 


mulations of the same constructs in 
terms of a verbal accompaniment 
introduced a new element of signifi- 
cance (MacCorquodale & Meehl, 
1948; Maze, 1954). As Ginsberg 
(1954) expressed it, the IV does not 
denote since it is purely functional. 
But where Hull's SHR would apply, it 
acquires a denotative meaning. Berg- 
mann was not slow to see that in 
Hull’s formulation of general laws 
covering elementary situations and 
composition rules, e.g., afferent neural 
interaction, summation of habit com- 
ponents, law of incompatible response 
tendencies, he uses the [V-type con- 
cept in a systematic way. But when 
it comes to a question of the prediction 
of behavior in a maze learning situa- 
tion, the 1V acquires a meaning in 
excess of its former systematic defini- 
tion (Beremann, 1953). Another 
such tens’. in Hull’s formulations 
was pointed out by Koch (1954). In 
the gloss-ry of The Principles of Be- 
havior, Hil (1943a) defines ‘‘ampli- 
tude”’ as ‘magnitude of intensity of a 
reaction.’’ Whereas, in the introduc- 
tion of Postulate 15 (1943a), the same 
concept is defined as “the amplitude 
(A) of responses mediated by the 
autonomic nervous system”’ (p. 344). 

Certainly the translatability of the 
IV into an HC has been admitted 
(MacCorquodale & Meehl, 1948; 
Marx, 1951b; Rommetveit, 1955). 
Feigl (1955) has even questioned 
whether there are any pure IVs in 
psychological theory. However, the 
distinction of IV and HC is not with- 
out meaning. Granted the difficulty 
of drawing the line between them at a 
given point in the development of the 
scientific construct, certainly they 
have at least a descriptive validity as 
delimiting two distinct moments in 
the scientific and theoretical en- 
terprise. 

Question has also been raised about 
the theoretical relevance of these con- 
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struct-types. Tolman and Hull had 
set up as the goal of scientific theoriz- 
ing the ultimate reduction of all 
theoretical formulations to the level 
of the purely functional IV. Their 
success in this endeavor is at best 
questionable. Following in their foot- 
steps, Skinner (1950) has expressed 
his distrust of theoretical construction 
(formation of HCs) and has opted for 
the IV as the ultimate scientific 
formulation. Feigl (1951) has labeled 
this attitude as the quest for the 
“empty organism.” The case has 
been clearly stated by Marx (1951la, 
1951b). The HC is regarded as pos- 
sessing a low degree of operational 
validity involving a certain non- 
operational vagueness. The IV, on 
the other hand, has a high operational 
validity and permits a rigorous deduc- 
tion and testability, which is conson- 
ant with the demands of scientific 
method (Marx, 195la). For this 
reason, the HC should be used as a 
temporary and sometimes helpful 
stage on the road to the ultimate 
formation of scientifically rigorous 
and operationally defined IVs (Marx, 
1951b). 

The latent supposition in this atti- 
tude seems to be that there is a meth- 
odological continuity between the HC 
and the IV. The continuity is ex- 
pressed in terms of operational va- 
lidity, which may be taken as veri- 
fiability in terms of explicitly stated 
defining operations. The idea has 
been endorsed by George (1953b) in 
his claim that logical constructs (HCs) 
must be replaced by operationally 
valid IVs and that the criterion of 
transformation depends on the num- 
ber of operations involved. The HC 
becomes a term which is vaguely 
defined in common sense terms with a 
loose admixture of operational link- 
ages. As the common sense terms 
are replaced by operational definitions, 


W. W. MEISSNER 


the construct gradually achieves the 
status of an IV. Thus empirical 
results can give no support to the 
theoretical validity of an HC unless 
its operational validity is increased 
(Marx, 1951b). Here, we can only 
raise the question of the relation of 
this attitude to that expressed by 
MacCorquodale and Meehl. Is there 
a continuity between HC and IV on 
the grounds of operational definition, 
or does the HC constitute a distinct 
constructive moment which is quite 
different from the IV, even when the 
IV is constructed with the highest 
possible operational validity ? 

Not all of the discussion diverges 
from MacCorquodale and Meehl’s dis- 
tinction. Flew (1956) and Frenkel- 
Brunswik (1954, 1956) accept the dis- 
tinction and apply it to psychoanalytic 
concepts. Brunswik (1952) has in- 
terpreted the factors derived from 
factor analysis in terms of HCs, es- 
pecially when they are regarded as 
functional unities or “faculty’’-surro- 
gates within the organism. Argyle 
(1957) describes the IV as a mathe- 
matical surrogate for partial response 
components, which seems to be an 
adequate reformulation of MacCor- 
quodale and Meehl’s_ conception. 
Ellis (1956) accepts the distinction 
but stresses the limited character of 
the IV and recommends the use of the 
“higher order’ HC. 

Other attempts have been made to 
describe the two construct-types and 
their relation. Rommetveit (1955) 
describes the HC as equivalent to the 
IV plus something extra, the ‘‘some- 
thing extra’’ being equivalent to sur- 
plus meaning. Bergmann (1953) at- 
tempts to place the ground of the 
distinction in physiological terms, 
making the IV an abstract concept 
without any physiological reference, 
and equating the HC simply to a 
physiologically interpreted 1V. These 
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formulations are undoubtedly in keep- 
ing with the spirit of MacCorquodale 
and Meehl, but their synonymity 
with the earlier formulation is not 
to be taken for granted, nor is it 
immediately evident. 

From what has already been said, 
it is not difficult to discern a certain 
dissatisfaction with the distinction of 
IV and HC. Several attempts have 
been made to reformulate or expand 
the distinction. Bertalanffy (1951), 
following Rosenblueth and Wiener 
(1945), proposed the distinction of 
material and formal models. The 
material model is an assumed hy- 
pothetical structure ultimately ob- 
servationally demonstrable (HC), and 
the formal model is an abstract, un- 
visualizable formalization of laws (IV). 
The contribution here is little more 
than terminological. In discussing 
scientific models, Zubin (1952) de- 
scribes three different types. The 
miniature model is a small-scale re- 


production of the real structure of the 


organism, a blue-print. Gap-filling 
models are constituted with HC-type 
constructs, which are taken as corre- 
spondences to the real structure being 
explained. Mathematical models are 
merely quantitative or qualitative ab- 
stractions of the real event which are 
used as the basis of further prediction. 
Gap-filling and mathematical models 
correspond respectively to the HC and 
IV with little distinguishable dif- 
ference. The additional miniature 
model either makes no attempt at 
description of inferred entities or proc- 
esses, or it is merely descriptive and 
not really theoretical at all. This 
formulation is thus seen to be equiva- 
lent to that of MacCorquodale and 
Meehl. 

O'Neil (1953) has suggested a three- 
fold distinction in types of hypothesis. 
The hy+othetical relation is merely a 
law relating phenomena (Tolman’s 


1V). The uncharacterized hypothetical 
term is an assumed entity which 
vaguely accounts for an observed rela- 
tion (“‘trace’’ as some sort of alteration 
of nervous tissue). The characterized 
hypothetical term accounts for the 
relation by an explicit and precise 
statement of what the assumed entity 
is (Hull’s Ds). O’Neil relates his 
characterized hypothetical term to the 
HC and the other twototheIV. The 
obvious discrepancy with regard to 
the uncharacterized term is resolved 
by his insistence that if such a term 
is found to be more than a relation, it 
must be considered asan HC. Hull's 
SHR, for example, is described as an 
IV by MacCorquodale and Meehl, yet 
it is certainly more than a mathe- 
matical relation. The ambiguity re- 
siding in the uncharacterized hy- 
pothetical term points up the difficulty 
in MacCorquodale and Meehl’s orig- 
inal formulation. 

More recently, Argyle (1957) has 
treated this theoretical problem in 
terms of axiomatic theories. Mac- 
Corquodale and Meehl’s HC and IV 
fall under hypothetical construct theo- 
ries whose constructs are axiomatically 
defined. Verifiable psychological laws 
are deducible from these axiomatic 
constructs. In addition he speaks of 
mechanism theories, in which a model 
is suggested whose presence in the 
organism would account for the ob- 
served laws. This supposes a formal 
similarity to the neurological structure 
and is regarded as a stage toward the 
development of hypothetical construct 
and reductive theories. The last type 
of theory is that with formal postu- | 
lates. ‘“Theory” here is used in the 
sense in which Braithwaite (1955, pp. 
89-93) uses it to distinguish it from a 
“‘model.”’ In this sense, this type of 
theory constitutes a formal structure 
of postulates and formal calculus. 
The mechanistic type of theory would 


} 
a 

Ey 
| 

ts 
“ m 4 
“ 
Af 
te 
H a 
4 
J 
y 
eee 


56 W. W. MEISSNER 


then be a possible material application 
or physical interpretation of the 
formal structure. It is clear that 
mechanistic models are at best a type 
of HC, but it is not clear how the 
formal theory is to be considered. In 
so far as it is structured it would seem 
to be more constructural, but we 
cannot overlook its purely formal 
character. Thus, the classification 
does not clearly reduce to the HC-IV 
scheme. The question will bear fur- 
ther consideration. 

Ginsberg (1954) has attempted to 
relate the HC-IV scheme to the con- 
cepts of “theory” and “law.” A law 
is a universal, synthetic proposition 
whose terms are IVs which are con- 
nected in invariant association. A 


theory is an invented, highly abstract 
and generalized system of concepts 
(HCs) whose association with the 
observed facts is indirect rather than 
explicit. 


Several points should be 
noted. First, it does not seem reason- 
able to exclude [Vs from theory, since 
it is not only conceivable that mathe- 
matical relations of phenomena and 
constructed explanations of the HC- 
type could function in the same theo- 
retical context (as may well be the 
case in Hull), but many theorists 
insist on the IV as the ultimate goal 
of scientific concept formation. Sec- 
ond, if a scientific law is regarded as 
a more or less constant or invariant 
association of empirically observed 
processes, it is difficult to see the IV as 
anything other than the mathemati- 
cal statement of that constancy. If 
the law, however, states the constancy 
of relation between IVs, then we are 
either dealing with subordinated or- 
ders of IVs or the concept of law used 
in this context needs critical assess- 
ment. With the inclusion of [Vs as 
possible theoretical elements, there 
does not seem to be any difficulty in 


conceiving ‘‘theory” as a higher-order 
integration of logical constructs. 

Carnap (1936) has suggested that 
the distinction of IV and HC is 
equivalent to his distinction of pure 
dispositions and theoretical terms. 
Carnap and Hempel (Feigl, 1955) 
seem to feel that the purely disposi- 
tional type of concept is open to 
direct confirmation of a definitive 
sort, whereas theoretical concepts can 
at best gain an increase of probability 
through empirical evidence. The cru- 
cial point is the constructive char- 
acter of the theoretical term in which 
regard it is equivalent to the HC. 
But the question of dispositions is not 
so easy. If the disposition is no more 
than relational, it can be regarded as 
equivalent to the IV. But it is not 
clear that it is to be so regarded 
(Carnap, 1956; Ginsberg, 1954). 
Frenkel-Brunswik (1956) seems to 
accept this equivalence since she has 
applied Carnap’s formulation to psy- 
choanalytic concepts in much the 
same way as her earlier application of 
the HC-IV scheme. The parallel is 
questionable since her conception of 
the IV allows for incomplete definition 
and surplus meaning. This concep- 
tion would seem to be closer to Carnap 
than to the original scheme. The 
problem raised here must ultimately 
be related to the question of surplus 
meaning and definition of theoretical 
constructs. At least, we can say that 
placing the question in terms of dis- 
positions reopens the issue of the 
definitional status and purely rela- 
tional character of the IV. 

Quite recently, Hilgard (1958) has 
proposed the distinction of between- 
events mediators and between-equa- 
tions mediators to supplement the 
distinction of IV and HC. In gen- 
eral, the HC is preferred as a mediator 
between events and the IV between 
equations. This distinction covers 
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the same ground as the HC-IV scheme, 
but in different terms. The IV can 
clearly be involved in the relation of 
events or equations. And for that 
matter, the HC can just as easily 
mediate equations as events; its 
hypothetical and constructural char- 
acter is determined by the hypothet- 
ical content it possesses in addition 
to the content of its defining opera- 
tions, whether real or mathematical. 
Hilgard himself recognizes that the 
identification of an HC as a mediator 
between events does not preclude its 
use as a quantitative mediator be- 
tween equations. The point is that 
its use as a quantitative mediator 
does not make it ipso facto an IV. 


SuRPLUS MEANING 


Surplus meaning has been variously 
described, but in general it can be 
taken as any meaning contained in a 
concept which is beyond that stated 


in its definition or in the laws relating 
it to other constructs (Maatsch & 
Behan, 1953). This immediately 
raises the question: What is to be the 
proper logical form of definition of a 
meaning which cannot be introduced 
by explicit observational definition? 
Feigl (1950) has charged that the 
phenomenalist orientation in account- 
ing for the meaning of theoretical 
constructs is inadequate. He ques- 
tions whether the analysis of theoret- 
ical concepts in terms of postulate 
systems, coordinating and opera- 
tional definitions can be upheld in 
view of scientific procedures and un- 
derlying semantical presuppositions. 
In proposing his existential hypothe- 
ses, which seem to be a type of HC, 
Feigl (1950) equates surplus meaning 
with the factual reference of theoret- 
ical constructs. The problem for 
him is one of divergence between the 
epistemic reduction of the construct 


and its semantical designation. The 
existential hypothesis (and for that 
matter, the HC) is in principle 
only indirectly confirmable. This 
means that the use of epistemic pre- 
dicaments will inevitably preclude any 
strict logical equivalence between in- 
directly confirmable statements and 
directly confirmable statements, and 
such indirectly confirmable statements 
must involve some sort of surplus 
meaning (Feigl: 1950a, 1950b). 

Ramsperger (1950) has questioned 
this assertion by pointing out that 
Feigl’s realist assumptions Of excess 
meaning of model-statements over 
observational statements and of the 
existential status of intrinsic proper- 
ties of the object are not in keeping 
with scientific procedures. His own 
contextualist view prescinds from the 
question of existence to seek the con- 
ditions under which the model would 
be discovered. There is some justi- 
fication for discounting the problem 
of existence as properly scientific, 
but we must insist (with Feigl) that 
the existence issue cannot for that 
reason be ignored. The question of 
the existential status of scientific 
constructs remains to be answered. 

A complicating issue in the discus- 
sion of surplus meaning is the ever 
present possibility of surplus meaning 
creeping in where it has been formally 
excluded. The charge has been re- 
peatedly leveled against Hull that 
many of his constructs, although 
seemingly intended as IVs, have 
wrapped themselves in the cloak of 
surplus meaning and are presently 
masquerading as HCs (Bergmann, 
1953; Koch, 1954; Rozeboom, 1956). 
The confusion generated by all this 
has led Bergmann (1953) to treat the 
HC-IV distinction as a pseudo dis- 
tinction and the issue of surplus mean- 
ing as a pseudo issue. In some quar- 
ters, the presence of surplus meaning 
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is taken as an index of incomplete or 
insufficient operational specification 
(George, 1953; Maatsch & Behan, 
1953; Marx: 1951a, 1951b). This 
has led many research psychologists 
to seek the elimination of surplus 
meaning as something foreign to sci- 
entific theory (Rommetveit, 1955). 
Even MacCorquodale and Meehl’s 
usage has been questioned. The 
question of the existence of an IV is 
equivalent for them to asking whether 
we have formulated correct state- 
ments of the relation of the IV 
to empirical variables. Maze (1954) 
points out that if there is a “relation 
of the IV,” the IV is then the term of 
a relation, which implies that it is 
some sort of qualitative entity and 
must involve surplus meaning. He 
concludes that the confusion of HC 
and IV is not foreign even to Mac- 
Corquodale and Meehl’s formulation. 

But were does surplus meaning 
originate? Koch (1954) feels that 
the extrapostulational properties as- 
signed to constructs are either extra- 
systemic aids or else intrinsic to the 
theory. He decides rather uneasily 
that some statements, at least, are 
part of the theory, but he is unable to 
decide what their place in the theory 
is. The origin and function of sur- 
plus meaning are a corollary of the 
function of theory and its relation to 
the real structure of organisms. It 
must be kept in mind that the theo- 
retical model is at once the product of 
a cognitive process in the mind of the 
theoretical scientist and a schematiza- 
tion of reality (Rommetveit, 1955). 
In other words, the content of the HC 
or any constructed model is not totally 
determined by the results of measure- 
ment and observation. There is al- 
ways the possibility that something 
more will be contributed either in 
terms of the real structure under in- 
vestigation, since the very approach 


to the organism is selective, or in 
terms of the constructive process. 
There is a strong feeling that not 
only is surplus meaning unavoidable 
in theoretical construction, but it is 
the element of theory which makes it 
most fruitful and most open to the 
growth and development of scientific 
psychology (Bergmann, 1953; Feigl: 
1950a, 1950b; Koch, 1954; Rozeboom, 
1956). Surplus meaning is essential 
to theory, for there is always an in- 
determinate range of factual content 
implicit in any significant concept 
(Ginsberg, 1954; Saugstad, 1956). 
But this carries us to the related and 
in itself significant issue of existential 
reference and the function of theory. 


EXISTENTIAL REFERENCE 


The problem of existential refer-’ 
ence is one that concerns the HC and 
not, strictly speaking, the IV, for 
there is no real problem about the 
existence of abstract mathematical 
relations. What, then, does it mean 
to say that a theoretical construct 
exists? Are the terms “real” and 
“existing” used univocally by all 
parties to the discussion, or must we 
reckon with several meanings with 
diverse connotations? Does the ques- 
tion of existential reference have any 
meaning at all for scientific constructs ? 

Krech (1950) views the HC as an 
actually existing structure which is 
open to direct experimental descrip- 
tion. Correlations of experimental 
conditions and results are necessary 
consequences of its assumed existence 
and intrinsic properties. There seems 
to be énough evidence to think that 
Hull, despite his metatheoretical ori- 
entation, conceived some of his con- 
cepts as actually occurring (Kendler, 
1952; MacCorquodale & Meehl, 1948; 
Meissner, 1958). Koch (1954) is of 
the opinion that this is indicative of a 
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certain amount of inability on the part 
of Hull to adopt a determinate posi- 
tion on the reality status of his theo- 
retical constructs. The same sort of 
attitude peeks through occasionally 
in Freud (Flew, 1956; Skinner, 1956). 
Lewin’s theorizing, in terms of its 
phenomenological content, seems to 
implicitly assert the existential status 
of its constructs (Meissner, 1958; 
Spence, 1944). Examples could be 
multiplied, but the essential note for 
all these theorists is that even though 
their expressed purpose is no more 
than the formation of theory for ex- 
planatory and predictive purposes, 
their theoretical endeavor carries with 
it an assumption, more or less verbal- 
ized, of the existence of a real struc- 
ture as conecived in the theory. Even 
Hebb (1951) will speak of a “real” 
neurological model as the goal of 
psychological theory. 

On the other hand, statements ex- 
plicitly denying this sort of identifica- 
Krech’s 


tion are by no mean lacking. 
position has been challenged (Kessen 


& Kimble, 1952). Zubin (1952) in- 
sists on the nonactual character of 
theoretical models, since they are 
really representations which can un- 
dergo alteration to account for experi- 
mental results. The abstractive char- 
acter of theoretical models makes it 
impossible to assume an isomorphic 
relation between the terms of the 
theory and nonverbal events. The 
model is only a scientific approxima- 
tion to the structure of reality and is 
therefore limited in its capacity to 
represent that reality (George, 1953b). 
Constructs, it is pointed out, are prod- 
ucts of scientific imagination guided 
by empirical knowledge of facts. 
They therefore have only those prop- 
erties which the scientist sees fit to 
give them in a particular formulation 
of his theory (Maatsch & Behan, 
1953). 


Kantor (1957) finds part of the 
confusion arising from a common fail- 
ure to keep constructs and events dis- 
tinct. The construct arises from an 
interbehavioral relation with an event. 
The confusion arises from the neces- 
sity of dealing with complex and elu- 
sive events, and also of working often 
with contrived events. There is a 
constant methodological temptation 
to reify theoretical constructs (Kend- 
ler, 1952). In general, methodologists 
seem to be of the opinion that theoret- 
ical models are not to be taken as 
definitive representations of the real 
structure of some real entity (Meiss- 
ner, 1958; Shoben, 1955; Spence, 
1944), but rather that they are sci- 
entifically refinable approximations 
which are capable of more and more 
rigorous validation and gradually con- 
verge on the real structure of the or- 
ganism (George, 1953b). 

It is obvious from this categoriza- 
tion of opinions that the terms “‘real- 
ity”’ and “existence” assume different 
meanings for different theorists in 
different contexts. Feigl (1950a) as- 
serts that surplus meaning involves an 
existential hypothesis which is either 
directly or indirectly confirmable. 
Rozeboom (1956) concurs in the di- 
rect confirmability of the HC, al- 
though Feigl (1950b) himself seems 
to shift his emphasis exclusively to 
indirect confirmation. Feigl (1950a) 
lists some of the approaches to indirect 
confirmation of existential hypothe- 
ses. The realistic positions infer that 
the referents of the HC really exist, 
but in varying senses. Physical Real- 
ism insists on the caused nature of be- 
havior, Probabilistic Realism infers 
existence from behavior in terms of 
probability, and Feigl’s own Semantic 
Realism finds that surplus meaning 
consists in the factual reference of 
constructs. Positivistic approaches 
would insist that hypotheses are de- 
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fined in terms of observable behavior 
and that the “entities’’ of scientific 
theory are, in the last analysis, purely 
formal constructs. Feigl also indi- 
cates an agnostic position which re- 
gards the hypothesis as no more than 
a useful fiction. 

While rejecting a metaphysical or 
physicalistic realism (Feigl, 1950b, 
Frank, 1950), Feigl strongly urges the 
adoption of existential hypotheses in 
the sense of Semantic Realism. In 
this sense, there is nothing implied 
about the nature of the designaia of 
theoretical constructs since the the- 
orist is concerned only with abstract 
and formal features. Feigl (1950b) 


makes it clear that this approach to 
theory involves the repudiation and 
abandonment of a phenomenalist or 
positivist interpretation of theory, 
and allows for the inclusion of surplus 
meaning which he feels is necessarily 
indirectly 


involved in confirmable 
statements. Frank (1950), however, 
takes an opposing tack in which he 
insists that all scientific statements 
have observable consequences, and 
hence any realistic terminology used 
in science must be explained opera- 
tionally. Thus, he concludes, the 
language of scientific theory must 
find Semantic Realism supplemented 
by Syntactical Positivism. Hempel 
(1950) reinforces this position by his 
claim that ‘factual reference’’ implies 
that the theoretical concepts are 
linked to each other and to an eviden- 
tial base by nomological relations. 
Theoretical constructs thus assert 
more than the relevant segment of 
the evidential base, but at the same 
time can be interpreted in a phenom- 
enal analysis which has no reference 
to the semantical referents of the HC. 

In these oppositions to Feigl’s 
formulation, the common note seems 
to be that the existential reference of 
theoretical constructs is a function of 


the operational status of the constructs 
(Frank, 1950; Maatsch & Behan, 
1953; Maze, 1954; Skinner, 1956). 
The fundamental issue raised here is 
whether the existence of a construct 
can be split off into that of its defining 
operations. What precisely is the 
meaning of the existence of a construct 
as distinct from that of its definition? 
This purely behaviorist view has been 
endorsed by Hull (1943). But, as 
Koch (1954) has pointed out, while 
Hull speaks as if his IVs were an- 
chored to directly observable events, 
they seem rather to be linked to terms 
in the construct language of the the- 
ory which is itself reducible to ob- 
servable events. 

There is not, however, universal 
agreement that existential reference 
can be reduced to a question of op- 
erational definition. Feigl (1950a, 
1950b, 1951) insists that this is 
not the case. Certainly the position 
taken by theorists of realist persuasion 
is that statements about the model con- 
tain more meaning than the observa- 
tion sentences into which they are 
translated (Ramsperger, 1950). This 
is fundamental to the whole notion of 
surplus meaning. Confirmatory evi- 
dence, furthermore, does not add ex- 
istential meaning in any real sense, 
since strictly speaking the relation of 
evidence to hypothesis is logical, and 
only that (Ginsberg, 1954). As Berg- 
mann (1951) suggests, there is no 
specific sense in which any defined 
concept refers to an “entity’’ whose 
“existence”’ is hypothesized. What is 
operationally defined is not construc- 
ted. And we might add, it is in con- 
struction that the issues of surplus 
meaning and existential reference arise, 
not in definition. That there is confu- 
sion in this area of consideration is 
clear when authors can assert that the 
reality status of a concept depends on 
its operational definition, and at the 
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same time that surplus meaning arises 
from the assumption of reality and the 
use of nonoperational terms (Maatsch 
& Behan, 1953). It is very likely 
this sort of confusion that brings the- 
orists to reject considerations of exist- 
ence as spurious and to emphasize 
the predictive and control aspects of 
theory (Ramsperger, 1950). Nagel 
(1950) in fact has rejected any privi- 
leged position for Semantic Realism on 
the ground that questions about the 
factual reference of theoretical con- 
structs are misleading, since there 
is a temptation to look for answers 
not within the context of inquiry of 
the construct, but within the frame- 
work of assumed conceptions about 
what the construct ought to do. 
There seems to be a need to distin- 
guish between the operational meaning 
of an intervening construct and the 
intuitive properties attributed to it 
(Kendler, 1952). George (1953a, 
1953b) has distinguished two major 
orientations toward this problem of 
bridging the gap between the level of 
experience or evidence and the level 
of theoretical construction. Prag- 
matic Realism starts with an assumed, 
really existent world which must be 
observed and mapped. Positivism 
or Operationism, on the other hand, 
starts with experience and observation 
and builds up a theoretical model 
consisting of observational properties. 
In George’s view, the operation can 
function as the limting test of an 
otherwise realistic model, such that 
Operationism and Pragmatic Realism 
can be thought of as complementary 
moments in the development of an 
explanatory model. The operational 
definition provides a criterion for 
validity of the realistic model (George, 
1953a). His example of the implicit 
acceptance of a mixed operational- 
realistic model is Hull’s linking of rein- 
forcement to SER and not SHR, which 


involves a distinction between per- 
formance and learning. The sugges- 
tion is one that merits consideration, 
but ultimately one must ask the ques- 
tion: Is reality attributed within the 
compass of the operational approach, 
or is it nonoperational or even con- 
traoperational ? 

The different uses and meanings of 
the terms “‘real’’ and “‘exist’’ call for 
some clarification. Beck (1950) has 
pointed out two senses in which con- 
structs are said to “‘exist.”” Systemic 
existence refers to the analytic conson- 
ance of a construct within a system of 
constructural entities. Real existence 
refers to the possibility of the construct 
functioning as the subject of synthetic 
propositions. In this context, he re- 
fers to theoretical concepts with “‘sup- 
posed real existence” as inferred enti- 
ties, and to concepts with systemic 
existence as constructs. Elsewhere 
(Meissner, 1958) I have attempted to 
show that there is room for further 
distinction in the senses of “‘existence”’ 
as it is found in psychological theory. 
In terms of this current synthesis, we 
should recognize a “real existence”’ 
which is predicated directly of the 
object or organism as a description of 
the real, objective, noninferred struc- 
ture of the organism. This is akin to 
what Feigl termed ‘“‘naive physical 
realism”’ and in scientific theory would 
seem to be illegitimate and un- 
grounded (Feigl, 1950a; Meissner, 
1958). In addition there is a ‘“‘the- 
oretical-real existence’”’ which is very 
close to Lindzey’s (1953) notion of 
conventional construct and which 
amounts to a legitimate methodologi- 
cal assumption and implies nothing 
about the real organism. This kind 
of existence arises by reason of the 
very fact that the construct involves 
some sort of theoretically conceived 
structure (Kessen & Kimble, 1952; 
Maze, 1954; Meissner, 1958). It is 
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impossibie to talk about a structure 
without thinking of it as existent. 
Whether it is or not is another 
question. 


DEFINITION 


Discussions impinging on questions 
of surplus meaning and existential 
reference have brought out the prob- 
lematic fact that the HC is very often 
not completely equatable to its de- 
fining operations (Feigl: 1950a, 1950b, 
1951; Koch, 1954; Krech, 1950; 
Northrop, 1947; Ramsperger, 1950). 
Usually in a strictly behavioristic 
framework, the norm of definition re- 
quires that antecedent and consequent 
variables be explicitly linked to ob- 
servables and the defining relations of 
the intervening construct to each set 
of variables be stated (Feigl, 1951; 
Hull, 1943b; Koch, 1954; Miller, 
1951). Such a rigorous criterion is 
rarely, if ever, attained and conse- 
quently many theorists have offered 
mitigated criteria, even when the 
ideal of complete reduction to ob- 
servables has been retained (Carnap, 
1956; Koch, 1954). There seems to 
be room for some sort of partial inter- 
pretation of theoretical terms (Car- 
nap, 1956; Ellis, 1956; Miller, 1951). 
In relation to testing procedures, 
Cronbach and Meehl (1956) have sug- 
gested that test constructs are not 
operationally defined, or at least not 
completely so, and Krech (1950) has 
argued that in a purely psychological 
context, HCs have suffered an ambigu- 
ity of definition. Koch (1954) has 
pointed out that even in Hull's highly 
formalized systerr not everything is in 
order. Even though it is claimed that 
IVs are exclusively defined by stated 
relations to empirical variables, Koch 
is able to distinguish three types of 
IVs: (a) IVs directly linked to inde- 
pendent variables (afferent neural 
impulse, drive), (6) IVs indirectly 
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related to independent variables, 
mediated by prior relations to other 
IVs (SHR, SER), and (c) IVs which 
have no relation at all to independent 
variables (SLR, SOR). This last 
group of [Vs are at variance with 
Hull’s metatheoretical conception of 
the nature of the IV since they are not 
linked, either programmatically or 
actually, to any observable antece- 
dent conditions (Koch, 1954). Other 
constructs in the system acquire prop- 
erties beyond those assigned in the 
defining postulates, e.g., SHR and 
SER. There is also confusion in 
terms of operational meaning, since 
his view of the IV precludes direct 
operational definitions. His IV is 
only indirectly definable in terms of 
the operational definitions of inde- 
pendent and dependent variables. 

All this (and more) has led Feigl 
(1950a) to question whether the con- 
ception of scientific theory as linking 
a postulational system to the empiri- 
cal realm through co-ordinating and 
operational definitions is at all ade- 
quate. A fresh orientation to the 
problem was given by Carnap’s (1936) 
criticism of operational formulations. 
Carnap reset the problem by his 
proposal of “reduction sentences” to 
account for the specification of 
the theoretical term. However, this 
specification is partial in the sense that 
Cx > (Qx = Ex), where the experi- 
mental results E provide evidence for 
the support of the theoretical term or 
construct Q, implies that Q is defined 
only under the specified conditions C. 
This leaves the theoretical concept 
open to different and supplementary 
operational criteria, and to different 
and supplementary reduction sen- 
tences in different experimental con- 
texts (Carnap, 1956; Feigl, 1955; 
Miller, 1951). Two points are im- 
portant here. First, the definition of 
Q is only partial in terms of the cor- 
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respondence rules linking it to E. If 
E is confirmed under C, the presence 
of Q is no more than probable. Sec- 
ond, the use of reduction sentences 
does not affect the problem of the 
meaning and definition of theoretical 
terms (in Carnap’s sense) which are 
not explicitly defined in observational 
language but are introduced by postu- 
late and, to that extent, are not com- 
pletely interpreted. This raises the 
very important question : Is it possible 
to obtain complete definition of the- 
oretical terms by means of an obser- 
vational vocabulary ? 

In a recent probing of the possibili- 
ties of answering this question, Hem- 
pel (1958) has concluded that re- 


duction sentences do not achieve the 
full definition of theoretical terms in 
observational language since they do 
not allow the statement in operations 
of all the primitive terms of the spe- 
cific theoretical vocabulary. 


In any 
case, no matter how the restatement of 
theoretical terms is attempted in an 
observational vocabulary, the de- 
finability of a given theoretical term 
will concern “meanings” of both the 
defined term and the defining set. 
Hempel (1958) seems to make an act 
of faith in the ultimate’ adequacy of 
extensional logic, but the issue of in- 
tentionality and meaning must be 
faced. Hempel’s argument deserves 
careful evaluation and study, for he 
may well have brought the argument 
to the point where extensional logic 
must give way, at least, to intensional 
supplementation, if that is possible. 
Hempel concludes his investigation 
with: “And as far as suitability for 
inductive systematization, along with 
economy and heuristic fertility, are 
considered essential characteristics of 
a scientific theory, theoretical terms 
can not be replaced without serious 
loss by formulations in terms of ob- 
servables only: the theoretician’s dil- 


emma, whose conclusion asserts the 
contrary, starts with a false premise” 
(1958, p.96). The ultimate question, 
I weuld suggest, may well be whether 
the italse premise is only that of the 
desirability of observational restate- 
ment of theoretical terms, or whether 
the false premise lies at a deeper level 
of theoretical presumptions and 
methodological commitments. 


LEVEL OF EXPLANATION 


The problem of definition is one 
that arises in determining the precise 
relations that obtain between the ex- 
planatory construct and the empirical 
observations or empirical events that 
serve as the construct’s inductive 
basis and confirmatory evidence. 
When the relations of construct and 
event have been specified, the further 
question presents itself: At what point 
in his theoretical endeavor can the 
theorist feel that his formulation is an 
adequate explanation of the empirical 
evidences? Two comments can be 
made here by way of orientation. 
First, the question here is one of the 
“adequacy” of a given explanation. 
“Adequacy” is a relative term, so that 
our concern is not with theory as 
definitive formulation. Formulation 
of scientific theory may well never be 
definitive. Second, the issue is pre- 
cisely the adequacy of a given explan- 
ation as a _ theoretical event. The 
theoretician’s judgment of adequacy 
must rest on certain specific criteria: 
incorporation of evidence, internal 
consistency, degree of confirmation, 
degree of control, etc. These criteria 
will be necessarily weighted in terms 
of the nature of the evidences in- 
volved, theoretical concerns, systema- 
tic and extrasystematic objectives, 
current stage of theory development, 
and a host of other influences. But 
theory must always function as such 
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in terms of empirical events of what- 
ever kind they may be. 

The level of explanation can be set 
at a psychological, physiological or 
physical level. Tolman (1932) ex- 
pressed the distinction of psychologi- 
cal and physiological explanation in 
terms of the ‘‘behavior-readinesses’’ 
and “immanent determinants”’ of the 
molar, psychological level as opposed 
to the neurological and glandular 
terminology of physiological explana- 
tions. The psychological explanation 
deals with its own proper “‘psychologi- 
cal” evidences which are excluded as 
proper evidences of the physiological 
explanation. For Tolman (1935), the 
two levels are correlated and comple- 
mentary, each with its own evidences, 
and together they provide an ade- 
quate account of behavior. The ques- 
tion, however, is: Can we do without 
one or the other? 

Some theorists have insisted on the 


necessity of a physiological reduction 


(Ellis, 1956; Hebb, 1951; Krech, 
1950; MacCorquodale & Meehl, 1948; 
Richfield, 1956). O’Neil’s (1953) 
characterized term would seem to be 
physiologically conceived, and Flew 
(1956) has argued that even psycho- 
analytic concepts aie in principle re- 
ducible to physiological terms. Al- 
though Skinner’s (1956) views have 
struck some as sceptical (Ellis, 1956) 
he has certainly indicated that he con- 
ceives physiological reference as an 
ultimate theoretical criterion. Re- 
flecting a certain amount of termino- 
logical confusion, both the HC and the 
IV are spoken of in physiological 
terms (Argyle, 1957 ; Bergmann, 1953; 
Ellis, 1956; Flew, 1956; Hebb, 1951; 
O’Neil, 1953). The purely functional 
character of the IV must be kept in 
mind in this discussion. It can func- 
tion in relating events described at 
any level, but the IV itself is neither 
psychological nor physiological. 


The tendency to seek physiological 
referents for psychological terms has 
not gone unquestioned. Krech’s phys- 
iological emphasis has been opposed 
on the grounds that physiological 
reference involves an anachronistic 
reality concern (Kessen & Kimble, 
1952) and Sellars (1956) has argued 
that behavior is by no means com- 
mitted to physiological identification 
of all its concepts. We should not 
allow the biological preoccupations of 
some theorists to distract us from 
agreement of theoretically derived 
statements with independently col- 
lected empirical data as a primary 
criterion of theoretical success (Lind- 
zey, 1953). There is no need to 
abandon all theory to the physiologists 
(Stafford, 1954). 

The predominant issue in this part 
of the discussion is clearly that of the 
logical and scientific status of physi- 
ological reduction and physiological 
behavior models in psychology (Berg- 
mann, 1953). Certainly reduction- 
ism has been roundly condemned. In 
terms of physical models, Buck (1956) 
has insisted that physical terms apply 
to man only in so far as he is a physical 
being with a physical structure, but 
that the solution of conceptual diffi- 
culties in psychology does not lie in 
the direction of abandoning psychology 
for physics. Physical models, like 
the telephone model of information 
theory, do not adequately describe or 
explain the psychological phenomenon. 
The telephone model is not really 
equivalent to the communication be- 
tween persons. MacKay (1954) has 
also argued convincingly against any 
kind of reductive account. He high- 
lights what he feels to be the crucial 
error of reductionism by the example 
of an electric sign which is identified 
reductively as “nothing but an array 
of lamps and wires,”’ with the result 
that we draw the conclusion that there 
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is no advertisement at all. The error 
here is patent. The question is 
whether or not it is a valid or fair 
‘criticism of reductive attempts in 
psychology. 

One does not have to deal with 
psychological theory very long before 
he is able to attain some level of 
sympathy with the earnest quest of 
the psychological theorist for uni- 
formity and nomotheticity. Krech's 
(1955) defense of his physiological 
claims rests on the grounds that uni- 
formity of theory lies ultimately in 
the physiological direction, and not in 
the molar ‘“‘psychological"’ direction. 
Reduction, then, is in reality a device 
for achieving methodological and sys- 
tematic uniformity. If such uniform- 
ity is desirable, then reductive pro- 
cedures are defensible. But we must 
remember that reduction to lower 
levels of explanation, whether physi- 
glogical or physical, and the resulting 
nomothetic uniformity have the in- 
escapable consequences of transition 
of evidences and organismic encapsu- 
lation (Brunswik: 1952, 1955). As 
long as we remember that any theo- 
retical structure is an explanation of a 
limited and selective set of events, 
that a fair appraisal of the theory 
must be directed to evaluating its 
relation to these limited events, and 
that the theory is restrictive in the 
sense of gaining partial accountability 
at only one level in what is actually a 
multi-dimensional analysis, we can 
clarify the real status of reductive 
methodologies. It can also be said, 
and has been said (Postman, 1955), 
that nomothesis does not depend on 
reduction. The real question here 
would seem to be—Can any kind of 
nomothetic theory be established 
which will be thoroughly psychologi- 
cal and account for the totality of 
evidence, and at the same time avoid 
reductive implications ? 
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Carnap’s (1932, 1935) attempt to 
reformulate the concepts of psychol- 
terms presents 
There 


is no question of reducing psychologi- 


ogy in physicalistic 


very much the same problem. 
cal laws to physical laws or of a con- 
deduction. The issue is lin- 
guistic (Bergmann: 1940a, 1940b). 
Carnap asserts that every term of the 


verse 


language, not only of psychology, but 
of all science, is reducible or translat- 
able to terms of physical language. 
His formulation for psychology has 


been challenged (Smedslund, 1955). 


Ultimately the fate of physicalism 
rests on the capacity of the theoretical 


terms to be reduced to the terms of 
the thing-language. That this is 
even possible remains an open ques- 
tion, as | believe Hempel has shown 
(1958). In any case, C 
must meet the challenge of excluded 
evidences, which it has not done and 


, Carnap’s notion 


may not be able to do. 

Tolman had originally spoken of 
“molar” and ‘molecular’ levels. 
The terms become standard 
usage in this part of the discussion 
and are still current (Feigl, 1950b; 
Koch, 1954: Littman & Rosen, 1950). 
The more recent usage of these cor- 


have 


relative terms has been examined and 
seems to have become multideter- 
mined (Littman & 1950). 
The terms have here in 
reference to levels of explanation as 
“psychological” and 
“physiological.” But they are often 
used with different implications: inter- 
action-isolation,genotypic-phenotypic, 
holistic- 


Rosen, 
been used 


equivalent to 


phenomenal-analytic, and 
atomistic are some of the contextually 
polarized meanings of ‘‘molar-molec- 
ular." Thus the term 
cal”’ could easily refer to the molecular 
level in terms of explanation levels 
and at the same time to the molar 
level in terms of interaction. Caution 
is needed. 


“physiologi- 
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THE NATURE OF THEORY 


Practically speaking, most of the 
issues we have discussed up to this 
point are predetermined by the res- 
pective theoretical orientations of 
the discussants. There are basically 
three distinctive orientations which 
deserve mention. At one extreme lies 
the correlationist attitude which, in 
terms of our previous discussion, 
would insist on the scientific validity 
of the IV, on rigorous operationalism 
and completeness of definition, on 
elimination of surplus meaning and 
overtones of existential reference. 
For this school the IV becomes mean- 
ingful only as a symbolic construct or 
a functional and economic expression 
of observables and their interrela- 
tions. Skinner (1950) has espoused 
the position, although there seems to 
be some doubt as to his adherence to it 
(Bergmann, 1953). Marx (195la, 
1951b) has clearly enunciated the is- 
sues relative to psychological method- 
ology, and even Hull and Tolman 
have been classed as correlationists 
(Feigl, 1951), at least in regard to 
their metatheoretical intentions. 

The second theoretical attitude is 
that of conventionalism, which I men- 
tion here as distinct because it pro- 
fessedly involves an element of postu- 
lational method which sets it aside 
from mere correlationism. They are 
often classed together because they 
agree in rejecting any positing of un- 
observable entities. The theorists in- 
volved in merely correlational pro- 
cedures may also participate in the 
conventional outlook, but correlation 
and convention are two different 
things. The distinction deserves in- 
vestigation, but it has only secondary 
importance to our present concern. 

The opposite extreme to correla- 
tionism is constructuralism, which 
places its emphasis on the value of 


the construct. In terms of our pre- 
vious analysis, the constructuralist 
stresses the place of the HC in theory, 
accepts surplus meaning as fruitful 
and desirable, views his theoretical 
constructs as somehow involving ex- 
istential reference (in various senses), 
and recognizes that operational defi- 
nition is never really complete and 
perhaps can never be so. MacCor-. 
quodale and Meehl (1948) seemed to 
take this orientation in their original 
formulation and those methodologists 
who follow them in the acceptance of 
some form of HC with surplus mean- 
ing and rejection of operational rigor- 
ism would be of the same persuasion. 
Krech’s (1950, 1955) physiological 
emphasis, the appeal to a modified 
operationalism on the part of Gerge 
(1953a, 1953b), and Ellis’ (1956) and 
Feigl’s (1950a, 1950b, 1955) insistence 
on existential hypotheses, as well as 
the resistance to behaviorist reduction 
on the part of psychoanalytic method- 
ologists (Frenkel-Brunswik, 1956; 
Richfield, 1956), can all be readily 
understood in this light. Some form 
of constructuralism would seem to be 
the predominant attitude among psy- 
chological theorists, and it will be 
readily seen, in the light of our former 
discussion, that many who profess a 
theoretical correlationism do not carry 
it out in practice. 

There is no need to conceive the 
two kinds of orientation as exclusive. 
Ginsberg’s (1954) discussion of law 
and theory is an attempt to incorpo- 
rate both correlationist (law) and 
constructuralist (theory) views. 
Argyle (1957) discusses them in terms 
of [V-type and HC-type theories. 
The [V-type coordinates empirical 
laws but does not relate generaliza- 
tions deductively. He implies by 
this that the method is nonaxiomatic. 
This would seem to correspond to our 
distinction of correlationism and con- 
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ventionalism. Generalizations are, 
however, related deductively in HC- 
type theories. Prediction in IV-type 
theories is limited, according to Ar- 
gyle, to a kind of extrapolation, 
whereas HC-prediction is more power- 
ful. This is an interesting emphasis, 
since those who take the correlation- 
ist’s view usually have prediction as 
opposed to explanation as their ob- 
jective (Seward, 1955; Stafford, 1954). 
They are concerned with the how of 
behavior, not the why (Scriven, 1954). 
The puzzling matter is whether the 
HC with all its difficulties actually 
yields better prediction than its meth- 
odologically chaste sister, the IV. 
Perhaps we ought to decide what we 
mean by prediction, and why we say 
that one prediction is better or 
stronger than another. We could ask 
the same question about ‘‘adequate 
explanation.”” The answer must be 
given in terms of theoretical orienta- 
tion. The same attitude toward 
types of theories as levels of explana- 
tion is reflected in Meadow’s (1957) 
discussion of iconic and <orrelational 
symbols. Each type of model has 
its Own proper use, aim! moreover, 
the adequacy of a given model is 
partly a function of the level of sym- 
bolization of that model. 

There is often an exaggeration of 
constructuralism which manifests it- 
self as an acceptance of the theoretical 
constructs themselves as the ultimate 
reality or as an uncritical and over- 
zealous dependence on the construct 
as an exact representation of the real 
structure (Lindzey, 1953; Meissner, 
1958). Sometimes too the construc- 
tural dimension of theory is treated as 
if the construct were formed without 
any empirical limitation or regulation 
at all. Statements to the effect that 
models can be created without refer- 
ence to real phenomena (Miller, 1951) 
can thus be misleading. It is clear 


that the construct or model is not 
the actual organism (Kantor, 1957; 
Meissner, 1958; Zubin, 1952), but 
rather that it is a sort of alterable 
representation (Zubin, 1952), a re- 
search tool which is experimentally 
useful and can be replaced or reformu- 
lated in terms of subsequent results 
(Dallenbach, 1953; MacKay, 1954). 
MacKay (1954) has described the 
model as a kind of adjustable tem- 
plate which is held up to the real sub- 
ject of investigation so that the dis- 
crepancies between it and the real 
structure will yield new information 
and suggest new directions of investi- 
gation. Discrepancies between model 
and organism, between prediction 
and fact, will result in revisions of the 
model in regard to semantic rules or 
model syntax for the purpose of 
achieving greater isomorphism be- 
tween model and event. It may also 
lead to a delimitation of the predictive 
field (Rommetveit, 1955). Certainly 
the scientific model is the product of 
theoretical imagination, but it is also 
true that the scientific imagination is 
an imagination guided by and de- 
pendent on empirical facts (Kantor, 
1957; Maatsch & Behan, 1953). In 
short, the model is an analogy (and a 
partial and limited one at that) which 
expresses certain aspects of the subject 
of investigation. Ultimately the the- 
oretical-constructural model is related 
to the real structure of its subject, 
but the relation is mediated through a 
complex set of selective processes, 
measurement techniques and meth- 
odological procedures (Meissner, 
1958). Within this mediational proc- 
ess, room must be made for the norm- 
ative status of real events and the 
event-guided, but not event-deter- 
mined, constructive activity of the 
scientist. Thus, while it is true that 
there is no restriction on the kind of 
construct which is proper to psycho- 
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logical methodology (Kessen & Kim- 
1952), it is true that the 
formation is 


also 


ble, 
primary rule of theory 
that it give an adequate account of 
empirically discoverable events. To 


understand theory in any other way 
is to destroy the function of theory. 
\s Spence (1957) has suggested, model 
construction is neither possible nor 
meaningful unless there is a body of 
information and experimentally de- 
termined laws: prior to the theory, 
which the theoretical 
plain. It is precisely this that raises 


the perennial dilemma of 


model can ex- 


scientific 
the paradox of exactitude at 
the price of fulness, validity at the 
the constant and 
recurring reality be- 
yond the theory and the reality form- 
the theory (Beshers, 


theory 


cost of generality, 
tension of the 
ulated within 


1957 


PERSONAL EXPERIENCI 


Within the framework of behavi- 
orism, experiential evidences fall un- 
der the heading of surplus meanings 
and have a peculiarly intimate relation 
to the question of existential reference. 
[-arlier treatments after the manner of 
Tolman (1932, 1935, 1936) tended to 
confuse experiential or *phenomeno- 
logical evidence with the molar be- 
havioral level; this the 
ambiguities and confusions which we 

Other 
no room tor 


resulted in 


have already discussed. be- 
haviorists leave little or 
the private datum of 
Either psychology proceeds behavior- 
or it includes 
private experience and forfeits its 
claim to scientific status (Conway, 
1955). At admitted as 


having a certain pretheoretical value 


experience. 


istically and is scientific, 


best, it is 


or perhaps merely descriptive value 
fer a “‘purely psychological’”’ 
(Krech, 1950). However, it is 
not to be dismissed so easily. 


(1955) has raised the problem of a 


cho- 
logy 
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phenomenological psychology, and al- 
though his attempt, in conjunction 
with Combs (1949), at systematiza 
tion was severely criticized (Smith, 
1950) for its insufficient account ol 
unconscious motivation, the attempt 
at least serves notice on a beleaguered 
behaviorism that there is another 
challenger in the field. 

The problem originates with the 
striking inability of the behaviorist 
orientation, strictly conceived, to 
make its account of behavior meaning- 
ful and relevant to the existential 
human situation. Furthermore, it is 
not entirely clear that the phenomeno- 
logical dimension is as entirely ex- 
cluded from behavioristic accounts as 
we are led to believe. Spence (1944) 
has pointed out that Lewin’s 
field theory is heavily phenomenologi- 
cal and Krech (1950) has suggested 
that tensions, needs, and cognitive 
structures are sometimes defined in 
terms of conscious experience. It ap- 
pears that the formal systematic con- 
structs of behavioristic systems are 
often subjected to a subtle reinter- 
pretation in phenomenological terms, 
when it becomes necessary to bring 
terms and constructs to bear 
in the human context. Even Hull's 
formulations seem to be subject 
to this sort of phenomenological re- 
formulation and _ reinterpretation 
(Meissner, 1958). 

It is not clear that the phenomeno- 
logical need remain separated from the 
behavioristic. Strawson (1958) has 
recently conducted an examination 
of the person as experiencing subject. 
He finds that the concept of person is 
unique in that it incorporates both 
behavioristic and conscious elements 
and that the concept is untenable 
without both these elements. The 
concept of person is in a strict sense 
primitive, in that its meaning includes 
necessarily and without priority both 
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conscious and behavioristic elements. 
This conceptualization, if it is valid, 
breaks through the dichotomy of con- 
sciousness and behavior, for it forces 
us to deal with a new level of con- 
ceptual focus, that of the person. 
Another interesting attack has been 
opened on this front by Kar! Zener 
(1958). He argues that the intersub- 
jective criterion is not essential for 
scientific observation. The “object- 
ivity’’ of science requires that, under 
specific and manipulable conditions, 
repeatable events (in this case, con- 
scious experiences) can be shown to 
recur. The criterion is the repeat- 
ability of obtained functional relations 
between specific experiences and speci- 
fic conditions, rather than public 
character. Thus, even though the 
technical difficulties are extremely 
complex, this approach opens up the 
possibility of setting the personal 
datum on a scientifically stable and 
The possibilities 


acceptable footing. 
opened by Strawson and Zener make 


it clear that it is not inconceivable 
that the framework of consideration 
characteristic of scientific psychology 
is excessively limited in scope and is 
in need of radical reassessment or 
complementation. If the phenom- 
enological element asserts itself in 
our most rigorous behavioristic sys- 
tems (Meissner, 1958), and the possi- 
bility is open for its systematic in- 
corporation, we can no longer afford 
to ignore this demand. All this, 
however, remains as vet in the realm 
of suggestion. 


(ONCLUSION 


| have tried to raise, in their proper 
contexts and in terms which would 
clarify the dynamics involved, several 
interrelated problems which derive 
from the general question of inter- 


vening constructs. These problems 


may well be transitional in the sense 
that the immature level of develop- 
ment within present methodological 
orientations will mature with time 
into a methodologically complete be- 
haviorism. Psychology is, after all, 
as Spence (1957) has remarked, in an 
early stage of growth. But the signs 
which we have tried to highlight 
would suggest that we cannot afford 
to rest easy with this sort of method- 
ological act of faith, but must explore 
the evidences that press themselves 
upon us. It is especially important to 
recognize the multiple influences that 
act to determine the course of argu- 
ment from the outside, as it were. 
Ultimately, many of the questions 
raised here will find their solution, 
not only in and through the analysis 
of theory and metatheory, but on the 
level of the philosophic presupposi- 
tions that are active in almost every 
phase of the discussion. 
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